Optimizing Spark Performance with the Native Execution Engine (NEE) in Microsoft Fabric

Spark tuning often starts with the usual suspects (shuffle volume, skew, join strategy, caching)… but sometimes the biggest win is simply executing the same logical plan on a faster engine.

Microsoft Fabric’s Native Execution Engine (NEE) does exactly that: it keeps Spark’s APIs and control plane, but runs a large portion of Spark SQL / DataFrame execution on a vectorized C++ engine.

What NEE is (and why it’s fast)

NEE is a vectorized native engine that integrates into Fabric Spark and can accelerate many SQL/DataFrame operators without you rewriting your code.

  • You still write Spark SQL / DataFrames.
  • Spark still handles distributed execution and scheduling.
  • For supported operators, compute is offloaded to a native engine (reducing JVM overhead and using columnar/vectorized execution).

Fabric documentation calls out NEE as being based on Apache Gluten (the Spark-to-native glue layer) and Velox (the native execution library).

When NEE tends to help the most

NEE shines when your workload is:

  • SQL-heavy (joins, aggregates, projections, filters)
  • CPU-bound (compute dominates I/O)
  • Primarily on Parquet / Delta

You’ll see less benefit (or fallback) when you rely on features NEE doesn’t support yet.

How to enable NEE (3 practical options)

1) Environment-level toggle (recommended for teams)

In your Fabric Environment settings, go to Acceleration and enable the native execution engine, then Save + Publish.

Benefit: notebooks and Spark Job Definitions that use that environment inherit the setting automatically.

2) Enable for a single notebook / job via Spark config

In a notebook cell:

%%configure
{
  "conf": {
    "spark.native.enabled": "true"
  }
}

For Spark Job Definitions, add the same Spark property.

3) Disable/enable per-query when you hit unsupported features

If a specific query uses an unsupported operator/expression and you want to force JVM Spark for that query:

SET spark.native.enabled=FALSE;
-- run the query
SET spark.native.enabled=TRUE;

How to confirm NEE is actually being used

Two low-friction checks:

  1. Spark UI / History Server: look for plan nodes ending with Transformer or nodes like *NativeFileScan / VeloxColumnarToRowExec.
  2. df.explain(): the same Transformer / NativeFileScan / Velox… hints should appear in the plan.

Fabric also exposes a dedicated view (“Gluten SQL / DataFrame”) to help spot which queries ran on the native engine vs. fell back.

Fallback is a feature (but you should know the common triggers)

NEE includes an automatic fallback mechanism: if the plan contains unsupported features, Spark will run that portion on the JVM engine.

A few notable limitations called out in Fabric documentation:

  • UDFs aren’t supported (fallback)
  • Structured streaming isn’t supported (fallback)
  • File formats like CSV/JSON/XML aren’t accelerated
  • ANSI mode isn’t supported

There are also some behavioral differences worth remembering (rounding/casting edge cases) if you have strict numeric expectations.

A pragmatic “NEE-first” optimization workflow

  1. Turn NEE on for the environment (or your job) and rerun the workload.
  2. If it’s still slow, open the plan and answer: is the slow part running on the native engine, or did it fall back?
  3. If it fell back, make the smallest possible change to keep the query on the native path (e.g., avoid UDFs; prefer built-in expressions; standardize on Parquet/Delta).
  4. Once the plan stays mostly native, go back to classic Spark tuning: reduce shuffle volume, fix skew, sane partitioning, and confirm broadcast joins.

References

This post was written with help from ChatGPT 5.2

The Best Thing That Ever Happened to Your Spark Pipeline Is a SQL Database

Here’s a counterintuitive claim: the most important announcement for Fabric Spark teams in early 2026 has nothing to do with Spark.

It’s a SQL database.

Specifically, it’s the rapid adoption of SQL database in Microsoft Fabric—a fully managed, SaaS-native transactional database that went GA in November 2025 and has been quietly reshaping how production data flows into lakehouse architectures ever since. If you’re a data engineer running Spark workloads in Fabric, this changes more than you think.

The ETL Pipeline You Can Delete

Most Spark data engineers have a familiar pain point: getting operational data from transactional systems into the lakehouse. You build ingestion pipelines. You schedule nightly batch loads. You wrestle with CDC (change data capture) configurations, watermark columns, and retry logic. You maintain all of it, forever.

SQL database in Fabric eliminates that entire layer.

When data lands in a Fabric SQL database, it’s automatically replicated to OneLake as Delta tables in near real-time. No pipelines. No Spark ingestion jobs. No orchestration. The data just appears, already in the open Delta format your notebooks and Spark jobs expect.

This isn’t a minor convenience—it’s an architectural shift. Every ingestion pipeline you don’t write is a pipeline you don’t debug at 2 AM.

What This Actually Looks Like in Practice

Let’s say you’re building an analytics layer on top of an operational SaaS application. Today, your architecture probably looks something like this:

  1. Application writes to Azure SQL or Cosmos DB
  2. ADF or Spark job pulls data on a schedule
  3. Data lands in a lakehouse as Delta tables
  4. Downstream Spark jobs transform and aggregate

With SQL database in Fabric, steps 2 and 3 vanish. Your application writes directly to the Fabric SQL database, and the mirrored Delta tables are immediately available for Spark processing. Here’s what your downstream notebook looks like now:

# Read operational data directly — no ingestion pipeline needed
# The SQL database auto-mirrors to OneLake as Delta tables
orders_df = spark.read.format("delta").load(
    "abfss://your-workspace@onelake.dfs.fabric.microsoft.com/your-sqldb.SQLDatabase/dbo.Orders"
)

# Your transformation logic stays the same
from pyspark.sql import functions as F

daily_revenue = (
    orders_df
    .filter(F.col("order_date") >= F.date_sub(F.current_date(), 7))
    .groupBy("product_category")
    .agg(
        F.sum("total_amount").alias("revenue"),
        F.countDistinct("customer_id").alias("unique_customers")
    )
    .orderBy(F.desc("revenue"))
)

daily_revenue.write.format("delta").mode("overwrite").saveAsTable("gold.weekly_revenue_by_category")

The Spark code doesn’t change. What changes is everything upstream of it.

The Migration Risk Nobody’s Talking About

Here’s where it gets interesting—and where Malcolm Gladwell would lean forward in his chair. The biggest risk of SQL database in Fabric isn’t technical. It’s organizational.

Teams that have invested heavily in ingestion infrastructure will face a classic innovator’s dilemma: the new path is simpler, but the old path already works. The temptation is to keep running your existing ADF pipelines alongside the new auto-mirroring capability, creating a hybrid architecture that’s worse than either approach alone.

My recommendation: don’t hybrid. Pick a workload, migrate it end-to-end, and measure. Here’s a concrete rollout checklist:

  1. Identify a candidate workload — Look for Spark jobs whose primary purpose is pulling data from a SQL source into Delta tables. These are your highest-value migration targets.
  2. Provision a Fabric SQL database — It takes seconds. You provide a name; Fabric handles the rest. Autoscaling and auto-pause are built in.
  3. Redirect your application writes — Point your operational application to the new Fabric SQL database. The engine is the same SQL Database Engine as Azure SQL, so T-SQL compatibility is high.
  4. Validate the Delta mirror — Confirm that your data is appearing in OneLake. Check schema fidelity, latency, and row counts:
# In your Spark notebook, validate the mirrored data
spark.sql("""
    SELECT COUNT(*) as row_count,
           MAX(modified_date) as latest_record,
           MIN(modified_date) as earliest_record
    FROM your_sqldb.dbo.Orders
""").show()
  1. Decommission the ingestion pipeline — Once validated, turn off the ADF or Spark ingestion job. Don’t just disable it—delete it. Zombie pipelines are how technical debt accumulates.
  2. Update your monitoring — Your existing data quality checks should still work since the Delta tables have the same schema. But update your alerting to watch for mirror latency instead of pipeline run failures.

The AI Angle Matters for Spark Teams Too

There’s a second dimension to this announcement that Spark engineers should pay attention to: the native vector data type in SQL database supports semantic search and RAG patterns directly in the transactional layer.

Why does that matter for Spark teams? Because it means your embedding pipelines can write vectors back to the same database your application reads from—closing the loop between batch ML processing in Spark and real-time serving in SQL. Instead of maintaining a separate vector store (Pinecone, Qdrant, etc.), you use the same SQL database that’s already mirrored into your lakehouse.

This is the kind of architectural simplification that compounds over time. Fewer systems means fewer failure modes, fewer credentials to manage, and fewer things to explain to your successor.

The Rollout Checklist

  • This week: Inventory your existing ingestion pipelines. How many just move data from SQL sources to Delta?
  • This sprint: Provision a Fabric SQL database and test the auto-mirror with a non-critical workload.
  • This quarter: Migrate your highest-volume ingestion pipeline and measure CU savings.
  • Track: Mirror latency, CU consumption before/after, and pipeline maintenance hours eliminated.

SQL database in Fabric went GA in November 2025 with enterprise features including row-level security, customer-managed keys, and private endpoints. For the full list of GA capabilities, see the official announcement. To understand how this fits into the broader Microsoft database + Fabric integration strategy, read Microsoft Databases and Microsoft Fabric: Your unified and AI-powered data estate. For Spark-specific Delta Lake concepts, the Delta Lake documentation remains the authoritative reference.

The best thing about this announcement isn’t any single feature. It’s that it makes your Spark architecture simpler by removing the parts that shouldn’t have been there in the first place.

This post was written with help from Claude Opus 4.6

Monitoring Spark Jobs in Real Time in Microsoft Fabric

If Spark performance work is surgery, monitoring is your live telemetry.

Microsoft Fabric gives you multiple monitoring entry points for Spark workloads: Monitor hub for cross-item visibility, item Recent runs for focused context, and application detail pages for deep investigation. This post is a practical playbook for using those together.

Why this matters

When a notebook or Spark job definition slows down, “run it again” is the most expensive way to debug. Real-time monitoring helps you:

  • spot bottlenecks while jobs are still running
  • isolate failures quickly
  • compare behavior across submitters and workspaces

1) Start at the Monitoring hub for cross-workspace triage

Use Monitoring in the Fabric navigation pane as your control tower.

  1. Filter by item type (Notebook, Spark job definition, Pipeline)
  2. Narrow by start time and workspace
  3. Sort by duration or status to surface outliers

For broad triage, this is faster than jumping directly into individual notebooks.

2) Pivot to Spark application details for root-cause analysis

Once you identify a problematic run, open the Spark application detail page and work through tabs in order:

  • Jobs: status, stages, tasks, duration, and processed/read/written data
  • Resources: executor allocation and utilization in near real time
  • Logs: inspect Livy, Prelaunch, and Driver logs; download when needed
  • Item snapshots: confirm exactly what code/parameters/settings were used at execution time

This sequence prevents false fixes where you tune the wrong layer.

3) Use notebook contextual monitoring while developing

For iterative tuning, notebook contextual monitoring keeps authoring, execution, and debugging in one place.

  1. Run a target cell/workload
  2. Watch job/stage/task progress and executor behavior
  3. Jump to Spark UI or detail monitoring for deeper traces
  4. Adjust code or config and rerun

4) A lightweight real-time runbook

  • Confirm scope in the Monitoring hub (single run or systemic pattern)
  • Open application details for the failing/slower run
  • Check Jobs for stage/task imbalance and long-running segments
  • Check Resources for executor pressure
  • Check Logs for explicit failure signals
  • Verify snapshots so you debug the exact submitted artifact

Common mistakes to avoid

  • Debugging from memory instead of snapshots
  • Looking only at notebook cell output and skipping Logs/Resources
  • Treating one anomalous run as a global trend without Monitor hub filtering

References

This post was written with help from ChatGPT 5.3

Running OpenClaw in Production: Reliability, Alerts, and Runbooks That Actually Work

Agents are fun when they’re clever. They’re useful when they’re boring.

If you’re running OpenClaw as an always-on assistant (cron jobs, health checks, publishing pipelines, internal dashboards), the failure mode isn’t usually “it breaks once.” It’s it flakes intermittently and you can’t tell if the problem is upstream, your network, your config, or the agent.

This post is the operational playbook that moved my setup from “cool demo” to “production-ish”: fewer false alarms, faster debugging, clearer artifacts, and tighter cost control.

The production baseline (don’t skip this)

Before you add features, lock the boring stuff:

  • One source of truth for cron/job definitions.
  • A consistent deliverables folder (so outputs don’t vanish into chat history).
  • A minimal runbook per job (purpose, dependencies, failure modes, disable/rollback).

Observability: prove what happened

When something fails, you want receipts — not vibes.

Minimum viable run-level observability:

  • job_name, job_id, run_id
  • start/end timestamp (with timezone)
  • what the job tried to do (high level)
  • what it produced (file paths, URLs)
  • what it depended on (network/API/tool)
  • the error and the evidence (HTTP status, latency, exception type)

Split latency: upstream vs internal

If Telegram is “slow,” is that Telegram API RTT/network jitter, internal queueing, or a slow tool call? Instrument enough to separate those — otherwise you’ll waste hours fixing the wrong layer.

Alert-only health checks (silence is success)

If a health check is healthy 99.9% of the time, it should not message you 99.9% of the time.

  • prints NO_REPLY when healthy
  • emits one high-signal alert line when broken
  • includes evidence (what failed, how, and where to look)

Example alert shape:

⚠️ health-rollup: telegram_rtt_p95=3.2s (threshold=2.0s) curl=https://api.telegram.org/ ts=2026-02-10T03:12:00-08:00

Cron hygiene: stop self-inflicted outages

  • Idempotency: re-runs don’t duplicate deliverables.
  • Concurrency control: don’t let overlapping runs pile up.
  • Deterministic first phase: validate dependencies before doing expensive work.
  • Deadman checks: alert if a job hasn’t run (or hasn’t delivered) in N hours.

Evidence-based alerts: pages should come with receipts

A useful alert answers: (1) what failed, (2) where is the evidence (log path / file path / URL), and (3) what’s the next action. Anything else is notification spam.

Cost visibility: make it measurable

  • batch work; avoid polling
  • cap retries
  • route routine work to cheaper models
  • log model selection per run
  • track token usage from local transcripts (not just “current session model”)

Deliverables: put outputs somewhere that syncs

Chat is not a file system. Every meaningful workflow should write artifacts to a synced folder (e.g., OneDrive): primary output, supporting evidence, and run metadata.

Secure-by-default: treat inputs as hostile

  • Separate read (summarize) from act (send/delete/post).
  • Require explicit confirmation for destructive/external actions.
  • Prefer allowlists over arbitrary shell.

Runbooks: make 2am fixes boring

  • purpose
  • schedule
  • dependencies
  • what “healthy” looks like
  • what “broken” looks like
  • how to disable
  • how to recover

What we changed (the short version)

  • Consolidated multiple probes into one evidence-based rollup.
  • Converted recurring checks to alert-only.
  • Standardized artifacts into a synced deliverables folder.
  • Added a lightweight incident runbook.
  • Put internal dashboards behind Tailscale on separate ports.

This post was written with help from ChatGPT 5.2

Lakehouse Table Optimization: VACUUM, OPTIMIZE, and Z-ORDER

If your Lakehouse tables are getting slower (or more expensive) over time, it’s often not “Spark is slow.” It’s usually table layout drift: too many small files, suboptimal clustering, and old files piling up.

In Fabric Lakehouse, the three table-maintenance levers you’ll reach for most are:

  • OPTIMIZE: compacts many small files into fewer, larger files (and can apply clustering)
  • Z-ORDER: co-locates related values to improve data skipping for common filters
  • VACUUM: deletes old files that are no longer referenced by the Delta transaction log (after a retention window)

Practical note: in Fabric, run these as Spark SQL in a notebook or Spark job definition (or use the Lakehouse maintenance UI). Don’t try to run them in the SQL Analytics Endpoint.

1) Start with the symptom: “small files” vs “bad clustering”

Before you reach for maintenance, quickly sanity-check what you’re fighting:

  • Many small files → queries spend time opening/reading lots of tiny Parquet files.
  • Poor clustering for your most common predicates (date, tenantId, customerId, region, etc.) → queries scan more data than they need.
  • Heavy UPDATE/DELETE/MERGE patterns → lots of new files + tombstones + time travel files.

If you only have small files, OPTIMIZE is usually your first win.

2) OPTIMIZE: bin-packing for fewer, bigger files

Basic compaction

OPTIMIZE my_table;

Target a subset (example: recent partitions)

OPTIMIZE my_table WHERE date >= date_sub(current_date(), 7);

A useful mental model: OPTIMIZE is rewriting file layout (not changing table results). It’s maintenance, not transformation.

3) Z-ORDER: make your filters cheaper

Z-Ordering is for the case where you frequently query:

  • WHERE tenantId = ...
  • WHERE customerId = ...
  • WHERE deviceId = ... AND eventTime BETWEEN ...

Example:

OPTIMIZE my_table ZORDER BY (tenantId, eventDate);

Pick 1–3 columns that dominate your interactive workloads. If you try to z-order on everything, you’ll mostly burn compute for little benefit.

4) VACUUM: clean up old, unreferenced files (carefully)

VACUUM is about storage hygiene. Delta keeps old files around to support time travel and concurrent readers. VACUUM deletes files that are no longer referenced and older than the configured retention threshold.

VACUUM my_table;

Two practical rules:

  1. Don’t VACUUM aggressively unless you understand the impact on time travel / rollback.
  2. Treat the retention window as a governance decision (what rollback window do you want?) not just a cost optimization.

5) Fabric-specific gotchas (the ones that actually bite)

Where you can run these commands

These are Spark SQL maintenance commands. In Fabric, that means notebooks / Spark job definitions (or the Lakehouse maintenance UI), not the SQL Analytics Endpoint.

V-Order and OPTIMIZE

Fabric also has V-Order, which is a Parquet layout optimization aimed at faster reads across Fabric engines. If you’re primarily optimizing for downstream read performance (Power BI/SQL/Spark), it’s worth understanding whether V-Order is enabled for your workspace and table writes.

A lightweight maintenance pattern that scales

  • Nightly/weekly: OPTIMIZE high-value tables (or recent partitions)
  • Weekly/monthly: Z-ORDER tables with stable query patterns
  • Monthly: VACUUM tables where your org’s time travel policy is clear

Treat it like index maintenance: regular, boring, measurable.

References

This post was written with help from ChatGPT 5.2

OneLake catalog in Microsoft Fabric: Explore, Govern, and Secure

If your Fabric tenant has grown past “a handful of workspaces,” the problem isn’t just storage or compute—it’s finding the right items, understanding what they are, and making governance actionable.

That’s the motivation behind the OneLake catalog: a central hub to discover and manage Fabric content, with dedicated experiences for discovery (Explore), governance posture (Govern), and security administration (Secure).

This post is a practical walk-through of what’s available today, with extra focus on what Fabric admins get in the Govern experience.

What is the OneLake catalog?

Microsoft describes the OneLake catalog as a centralized place to find, explore, and use Fabric items—and to govern the data you own.

You open it from the Fabric navigation pane by selecting the OneLake icon.

Explore tab: tenant-wide discovery without losing context

The Explore tab is the “inventory + details” experience:

  • An items list of Fabric content you can access (and in some cases, content you can request access to).
  • An in-context details pane so you can inspect an item without navigating away from your filtered list.
  • Filters and selectors to narrow scope (for example: workspace, item-type categories, endorsement, and tags).

A key pattern here is fast triage: filter down to a domain/workspace, then click through items to answer:

  • Who owns this?
  • Where does it live?
  • When was it refreshed?
  • Is it endorsed/certified?
  • Does it have sensitivity labeling?

Tip for data engineers

If your tenant uses domains, scoping the catalog to a domain/subdomain is often the quickest way to keep the item list meaningful—especially when teams create similar notebooks/pipelines across many workspaces.

Govern tab: governance posture + recommended actions

The Govern tab is where the catalog becomes more than “a directory.” It combines:

  • Insights (high-level indicators you can drill into)
  • Recommended actions (with step-by-step remediation guidance)
  • Links to relevant tools and learning resources

Admin view vs. data owner view

The Govern tab behaves differently depending on who you are:

  • Fabric admins see insights based on tenant metadata (items, workspaces, capacities, domains).
  • Data owners see insights scoped to items they own (using the My items concept).

The Fabric blog also calls out a preview experience that extends the OneLake catalog governance view for Fabric admins, providing consolidated indicators and deeper drill-down reporting.

What admins see on the Govern tab

From the Fabric admin perspective, the Govern experience is designed to answer:

  • What does our data estate look like (inventory, distribution, usage)?
  • Where are we under-labeled or non-compliant (sensitivity coverage, policy posture)?
  • What content is hard to trust or reuse (freshness, endorsement/description/tag coverage, sharing patterns)?

When admins choose View more, Learn documentation describes an expanded report with three areas:

  1. Manage your data estate (inventory, capacities/domains, feature usage)
  2. Protect, secure & comply (sensitivity label coverage and data loss prevention policy posture)
  3. Discover, trust, and reuse (freshness, curation signals such as endorsement/description coverage, sharing)

A detail worth knowing: refresh cadence differs for admins

Per Microsoft Learn, admin insights and actions are based on Admin Monitoring Storage data and refresh automatically every day, so there can be a lag between changes you make and what the Govern insights reflect.

Secure tab: centralized security role management

The OneLake catalog Secure tab is a security administration surface that centralizes:

  • Workspace roles and permissions (for auditing access)
  • OneLake security roles across workspaces and item types

From the Secure tab, admins can create, edit, or delete OneLake security roles from a single location.

A practical workflow to adopt (teams + admins)

Here’s a lightweight approach that scales better than “ask around on Teams”:

  1. Explore: Use domain/workspace scoping + filters to find candidate items.
  2. Inspect: Use the in-context details pane to sanity-check ownership, endorsement, sensitivity, and freshness.
  3. Govern: Use the recommended actions cards to drive a small number of measurable improvements:
    • increase sensitivity label coverage
    • improve endorsement/certification where appropriate
    • standardize descriptions/tags for key assets
  4. Secure: Audit role sprawl and standardize how OneLake security roles are managed across items.

Considerations and limitations to keep in mind

A few constraints called out in Learn documentation (useful when you’re setting expectations):

  • The Govern tab doesn’t support cross-tenant scenarios or guest users.
  • The Govern tab isn’t available when Private Link is activated.
  • Govern insights for admins can be up to a day behind due to daily refresh of admin monitoring storage.

References

This post was written with help from ChatGPT 5.2

Understanding Spark Execution in Microsoft Fabric

Spark performance work is mostly execution work: understanding where the DAG splits into stages, where shuffles happen, and why a handful of tasks can dominate runtime.

This post is a quick, practical refresher on the Spark execution model — with Fabric-specific pointers on where to observe jobs, stages, and tasks.

1) The execution hierarchy: Application → Job → Stage → Task

In Spark, your code runs as a Spark application. When you run an action (for example, count(), collect(), or writing a table), Spark submits a job. Each job is broken into stages, and each stage runs a set of tasks (often one task per partition).

A useful mental model:

  • Tasks are the unit of parallel work.
  • Stages group tasks that can run together without needing data from another stage.
  • Stage boundaries often show up where a shuffle is required (wide dependencies like joins and aggregations).

2) Lazy evaluation: why “nothing happens” until an action

Most DataFrame / Spark SQL transformations are lazy. Spark builds a plan and only executes when an action forces it.

Example (PySpark):

from pyspark.sql.functions import col

df = spark.read.table("fact_sales")
# Transformations (lazy)
filtered = df.filter(col("sale_date") >= "2026-01-01")

# Action (executes)
print(filtered.count())


This matters in Fabric notebooks because a single cell can trigger multiple jobs (for example, one job to materialize a cache and another to write output).

3) Shuffles: the moment your DAG turns expensive

A shuffle is when data must be redistributed across executors (typically by key). Shuffles introduce:

  • network transfer
  • disk I/O (shuffle files)
  • spill risk (memory pressure)
  • skew/stragglers (a few hot partitions dominate)

If you’re diagnosing a slow pipeline, assume a shuffle is the culprit until proven otherwise.

4) What to check in Fabric: jobs, stages, tasks

Fabric gives you multiple ways to see execution progress:

  • Notebook contextual monitoring: a progress indicator for notebook cells, with stage/task progress.
  • Spark monitoring / detail monitoring: drill into a Spark application and see jobs, stages, tasks, and duration breakdowns.

When looking at a slow run, focus on:

  • stages with large shuffle read/write
  • long-tail tasks (stragglers)
  • spill metrics (memory → disk)
  • skew indicators (a few tasks far slower than the median)

5) A repeatable debugging workflow (that scales)

  1. Start with the plandf.explain(True) for DataFrame/Spark SQL
    • Look for Exchange operators (shuffle) and join strategies (broadcast vs shuffle join)
  2. Run once, then open monitoringIdentify the longest stage(s)
    • Confirm whether it’s CPU-bound, shuffle-bound, or spill-bound
  3. Apply the common fixes in orderAvoid the shuffle (broadcast small dims)
    • Reduce shuffle volume (filter early, select only needed columns)
    • Fix partitioning (repartition by join keys; avoid extreme partition counts)
    • Turn on AQE (spark.sql.adaptive.enabled=true) to let Spark coalesce shuffle partitions and mitigate skew

Quick checklist

  • Do I know which stage is dominating runtime?
  • Is there an Exchange / shuffle boundary causing it?
  • Are a few tasks straggling (skew), or are all tasks uniformly slow?
  • Am I broadcasting what should be broadcast?
  • Is AQE enabled, and is it actually taking effect?

References

This post was written with help from ChatGPT 5.2

Fabric Spark Shuffle Tuning: AQE + partitions for Faster Joins

Shuffles are where Spark jobs go to get expensive: a wide join or aggregation forces data to move across the network, materialize shuffle files, and often spill when memory pressure spikes.

In Microsoft Fabric Spark workloads, the fastest optimization is usually the boring one: avoid the shuffle when you can, and when you can’t, make it smaller and better balanced.

This post lays out a practical, repeatable approach you can apply in Fabric notebooks and Spark job definitions.

1) Start with the simplest win: avoid the shuffle

If one side of your join is genuinely small (think lookup/dimension tables), use a broadcast join so Spark ships the small table to executors and avoids a full shuffle.

In Fabric’s Spark best practices, Microsoft explicitly calls out broadcast joins for small lookup tables as a way to avoid shuffles entirely.

Example (PySpark):

from pyspark.sql.functions import broadcast

fact = spark.read.table("fact_sales")
dim  = spark.read.table("dim_product")

# If dim_product is small enough, broadcast it
joined = fact.join(broadcast(dim), on="product_id", how="left")

If you can’t broadcast safely, move to the next lever.

2) Make the shuffle less painful: tune shuffle parallelism

Spark controls the number of shuffle partitions for joins and aggregations with spark.sql.shuffle.partitions (default: 200 in Spark SQL).

  • Too few partitions → huge partitions → long tasks, spills, and stragglers.
  • Too many partitions → tiny tasks → scheduling overhead, excess shuffle metadata, and unnecessary overhead.

Example (session-level setting):

spark.conf.set("spark.sql.shuffle.partitions", "400")

A decent heuristic is to start with something proportional to total executor cores and then iterate using the Spark UI (watch stage task durations, shuffle read/write sizes, and spill metrics).

3) Let Spark fix itself (when it can): enable AQE

Adaptive Query Execution (AQE) uses runtime statistics to optimize a query as it runs.

Fabric’s Spark best practices recommend enabling AQE to dynamically optimize shuffle partitions and handle skewed data automatically.

AQE is particularly helpful when:

  • Your input data distribution changes day-to-day
  • A static spark.sql.shuffle.partitions value is right for some workloads but wrong for others
  • You hit skew where a small number of partitions do most of the work

Example:

spark.conf.set("spark.sql.adaptive.enabled", "true")

4) Diagnose like you mean it: what to look for in Spark UI

When a job is slow, treat it like a shuffle problem until proven otherwise.

Look for:

  • Stages where a handful of tasks take dramatically longer than the median (classic skew)
  • Large shuffle read/write sizes concentrated in a small number of partitions
  • Spill (memory → disk) spikes during joins/aggregations

When you see skew, your options are usually:

  • Broadcast (if feasible)
  • Repartition on a better key
  • Salt hot keys (advanced)
  • Enable AQE and confirm it’s actually taking effect

A minimal checklist for Fabric Spark teams

  1. Use DataFrame APIs (keep Catalyst in play).
  2. Broadcast small lookup tables to avoid shuffles.
  3. Set a sane baseline for spark.sql.shuffle.partitions.
  4. Enable AQE and validate in the query plan / UI.
  5. Iterate with the Spark UI: measure, change one thing, re-measure.

References

This post was written with help from ChatGPT 5.2

OneLake Shortcuts + Spark: Practical Patterns for a Single Virtual Lakehouse

If you’ve adopted Microsoft Fabric, there’s a good chance you’re trying to reduce the number of ‘copies’ of data that exist just so different teams and engines can access it.

OneLake shortcuts are one of the core primitives Fabric provides to unify data across domains, clouds, and accounts by making OneLake a single virtual data lake namespace.

For Spark users specifically, the big win is that shortcuts appear as folders in OneLake—so Spark can read them like any other folder—and Delta-format shortcuts in the Lakehouse Tables area can be surfaced as tables.

What a OneLake shortcut is (and isn’t)

A shortcut is an object in OneLake that points to another storage location (internal or external to OneLake).

Shortcuts appear as folders and behave like symbolic links: deleting a shortcut doesn’t delete the target, but moving/renaming/deleting the target can break the shortcut.

From an engineering standpoint, that means you should treat shortcuts as a namespace mapping layer—not as a durability mechanism.

Where you can create shortcuts: Lakehouse Tables vs Files

In a Lakehouse, you create shortcuts either under the top-level Tables folder or anywhere under the Files folder.

Tables has constraints: OneLake doesn’t support shortcuts in subdirectories of the Tables folder, and shortcuts in Tables are typically meant for targets that conform to the Delta table format.

Files is flexible: there are no restrictions on where you can create shortcuts in the Files hierarchy, and table discovery does not happen there.

If a shortcut in the Tables area points to Delta-format data, the lakehouse can synchronize metadata and recognize the folder as a table.

One documented gotcha: the Delta format doesn’t support table names with space characters, and OneLake won’t recognize any shortcut containing a space in the name as a Delta table.

How Spark reads from shortcuts

In notebooks and Spark jobs, shortcuts appear as folders in OneLake, and Spark can read them like any other folder.

For table-shaped data, Fabric automatically recognizes shortcuts in the Tables section of the lakehouse that have Delta/Parquet data as tables—so you can reference them directly from Spark.

Microsoft Learn also notes you can use relative file paths to read data directly from shortcuts, and Delta shortcuts in Tables can be read via Spark SQL syntax.

Practical patterns (what I recommend in real projects)

Pattern 1: Use Tables shortcuts for shared Delta tables you want to show up consistently across Fabric engines (Spark + SQL + Direct Lake scenarios via semantic models reading from shortcuts).

Pattern 2: Use Files shortcuts when you need arbitrary formats or hierarchical layouts (CSV/JSON/images, nested partitions, etc.) and you’re fine treating it as file access.

Pattern 3: Prefer shortcuts over copying/staging when your primary goal is to eliminate edge copies and reduce latency from data duplication workflows.

Pattern 4: When you’re operationalizing Spark notebooks, make the access path explicit and stable by using the shortcut path (the place it appears) rather than hard-coding a target path that might change.

Operational gotchas and guardrails

Because moving/renaming/deleting a target path can break a shortcut, add lightweight monitoring for “broken shortcut” failures in your pipelines (and treat them like dependency failures).

For debugging, the lakehouse UI can show the ABFS path or URL for a shortcut in its Properties pane, which you can copy for inspection or troubleshooting.

Outside of Fabric, services can access OneLake through the OneLake API, which supports a subset of ADLS Gen2 and Blob storage APIs.

Summary

Shortcuts give Spark a clean way to treat OneLake like a unified namespace: read shortcuts as folders, surface Delta/Parquet data in Tables as tables, and keep your project’s logical paths stable even when physical storage locations vary.

References

This post was written with help from ChatGPT 5.2

When ‘Native Execution Engine’ Doesn’t Stick: Debugging Fabric Environment Deployments with fabric-cicd

If you’re treating Microsoft Fabric workspaces as source-controlled assets, you’ve probably started leaning on code-first deployment tooling (either Fabric’s built-in Git integration or community tooling layered on top).

One popular option is the open-source fabric-cicd Python library, which is designed to help implement CI/CD automations for Fabric workspaces without having to interact directly with the underlying Fabric APIs.

For most Fabric items, a ‘deploy what’s in Git’ model works well—until you hit a configuration that looks like it’s in source control, appears in deployment logs, but still doesn’t land in the target workspace.

This post walks through a real example from fabric-cicd issue #776: an Environment artifact where the “Enable native execution engine” toggle does not end up enabled after deployment, even though the configuration appears present and the PATCH call returns HTTP 200.

Why this setting matters: environments are the contract for Spark compute

A Fabric environment contains a collection of configurations, including Spark compute properties, that you can attach to notebooks and Spark jobs.

That makes environments a natural CI/CD unit: you can standardize driver/executor sizing, dynamic executor allocation, and Spark properties across many workloads.

Environments are also where Fabric exposes the Native Execution Engine (NEE) toggle under Spark compute → Acceleration.

Microsoft documents that enabling NEE at the environment level causes subsequent jobs and notebooks associated with that environment to inherit the setting.

NEE reads as enabled in source, but ends up disabled in the target

In the report, the Environment’s source-controlled Sparkcompute.yml includes enable_native_execution_engine: true along with driver/executor cores and memory, dynamic executor allocation, Spark properties, and a runtime version.

The user then deploys to a downstream workspace (PPE) using fabric-cicd and expects the deployed Environment to show the Acceleration checkbox enabled.

Instead, the target Environment shows the checkbox unchecked (false), even though the deployment logs indicate that Spark settings were updated.

A key signal in the debug log: PATCH request includes the field, response omits it

The issue includes a fabric-cicd debug snippet showing a PATCH to an environments .../sparkcompute endpoint where the request body contains enableNativeExecutionEngine set to true.

However, the response body shown in the issue includes driver/executor sizing and Spark properties but does not include enableNativeExecutionEngine.

The user further validates the discrepancy by exporting/syncing the PPE workspace back to Git: the resulting Sparkcompute.yml shows enable_native_execution_engine: false.

What to do today: treat NEE as a “verify after deploy” setting

Until the underlying behavior is fixed, assume this flag can drift across environments even when other Spark compute properties deploy correctly.

Practically, that means adding a post-deploy verification step for downstream workspaces—especially if you rely on NEE for predictable performance or cost.

Checklist: a lightweight deployment guardrail

Here’s a low-friction way to catch this class of issue early (even if you don’t have an automated API read-back step yet):

  • Ensure the source-controlled Sparkcompute.yml includes enable_native_execution_engine: true.
  • Deploy with verbose/debug logging and confirm the PATCH body contains enableNativeExecutionEngine: true.
  • After deployment, open the target Environment → Spark compute → Acceleration and verify the checkbox state.
  • Optionally: export/sync the target workspace back to Git and confirm the exported Sparkcompute.yml matches your intent.

Workarounds (choose your tradeoff)

If you’re blocked, the simplest workaround is operational: enable NEE in the target environment via the UI after deployment and treat it as a manual step until the bug is resolved.

If you need full automation, a more robust approach is to add a post-deploy validation/remediation step that checks the environment setting and re-applies it if it’s not set.

Reporting and tracking

If you’re affected, add reproducibility details (runtime version, library version, auth mode) and any additional debug traces to issue #776 so maintainers can confirm whether the API ignores the field, expects a different contract, or requires a different endpoint/query parameter.

Even if you don’t use fabric-cicd, the pattern is broadly relevant: CI/CD is only reliable when you can round-trip configuration (write, then read-back to verify) for each control surface you’re treating as ‘source of truth.’

Closing thoughts

Native Execution Engine is positioned as a straightforward acceleration you can enable at the environment level to benefit subsequent Spark workloads.

When that toggle doesn’t deploy as expected, the pragmatic response is to verify after deploy, document the drift, and keep your CI/CD pipeline honest by validating the settings you care about—not just the HTTP status code.

References

This post was written with help from ChatGPT 5.2