velox – Christopher Finlan

The Production Migration Checklist for Fabric's Native Execution Engine

You have been running Spark on the JVM for years. It works. Your pipelines finish before the SLA alarm fires, your data scientists get their DataFrames, and you have learned to live with the garbage collector the way one learns to coexist with a roommate who occasionally rearranges all the furniture at 3 AM.

Then Microsoft shipped the Native Execution Engine for Fabric Spark, and the pitch is seductive: swap the JVM’s row-at-a-time processing for a vectorized C++ execution layer built on Meta’s Velox and Apache Gluten, get up to 6x faster query performance on compute-heavy workloads, change zero lines of code, pay nothing extra. Microsoft’s TPC-DS benchmarks at 1 TB scale show roughly 4x improvement over vanilla open-source Spark. Internal Fabric workloads have hit 6x.

Those are real numbers. But “flip the switch and go faster” is a marketing sentence, not an engineering plan. What follows is the checklist your team needs to move production Spark workloads onto the Native Execution Engine without discovering exciting new failure modes at 2 AM on a Tuesday.

Prerequisite Zero: Understand What You Are Opting Into

The Native Execution Engine does not replace Spark. It replaces Spark’s JVM-based physical execution operators — the actual computation — with native C++ equivalents for supported operations. Everything above the physical plan remains untouched: SQL parsing, logical optimization, cost-based rewrites, adaptive query execution, predicate pushdown, column pruning. None of that moves.

Here is the handoff in concrete terms. Spark produces its optimized physical plan as it always has. Apache Gluten intercepts that plan, identifies which operators have native C++ implementations in Velox, and swaps those nodes out. Velox executes them using columnar batches and SIMD instructions, processing 8, 16, or 32 values per CPU instruction instead of iterating row by row through JVM objects.

For operators Velox does not yet support, the engine falls back to standard Spark execution. The transition at the native/JVM boundary requires columnar-to-row and row-to-columnar conversions. These conversions cost real time. A workload that triggers frequent fallbacks can run slower with the engine enabled than without it.

That last sentence matters more than the benchmark numbers. The Native Execution Engine is a selective replacement of physical operators, not a uniform accelerator. Your performance outcome depends on how much of your workload stays in native territory.

Step 1: Confirm You Are on Runtime 1.3

The engine requires Fabric Runtime 1.3 (Apache Spark 3.5, Delta Lake 3.2). Runtime 1.2 support has been discontinued — and here is the dangerous part — silently. If you are still on 1.2, native acceleration is disabled without warning. You will not get an error. You will get no speedup. You will blame the engine rather than your runtime version. Check this first.

Action items:
– Open each Fabric workspace running production Spark workloads
– Navigate to Workspace Settings → Data Engineering/Science → Spark Settings
– Confirm Runtime 1.3 is selected
– If you are on Runtime 1.2, plan the runtime upgrade as a separate migration with its own validation cycle. Spark 3.4 to 3.5 brings behavioral changes unrelated to the native engine, and you do not want to debug two migrations at once

Step 2: Audit Your Workloads

Not every job benefits equally. The engine does its best work on compute-intensive analytical queries — aggregations, joins, filters, projections, complex expressions — over Parquet and Delta data. It adds less to I/O-bound workloads or jobs dominated by Python UDFs that run outside the Spark execution engine entirely.

Build a four-tier inventory:

Tier 1 — High-value candidates: Long-running batch ETL with heavy aggregations and joins over Delta tables. These are your biggest CU consumers and your biggest potential beneficiaries. Think: the nightly pipeline that computes vendor aggregates across three years of transaction data, currently consuming 45 minutes of a large cluster.
Tier 2 — Likely beneficiaries: Interactive notebooks running analytical queries. Data science feature engineering pipelines that stack transformations before model training.
Tier 3 — Uncertain: Workloads using exotic operators, deeply nested struct types, or heavy UDF logic. These need individual testing because you cannot predict fallback behavior from the code alone.
Tier 4 — Skip for now: Streaming workloads, jobs dominated by external API calls, or workloads where Python UDF processing accounts for most of the wall-clock time.

Migrate Tier 1 first. You need evidence that the engine delivers measurable wins on your actual workloads before you spend political capital rolling it out everywhere.

Step 3: Create a Non-Production Test Environment

Do not enable the engine on production and hope. Create a dedicated Fabric environment:

In the Fabric portal, create a new Environment item
Navigate to the Acceleration tab
Check Enable native execution engine
Save and Publish

Attach this environment to a non-production workspace. Run your Tier 1 workloads against it using production-scale data. This matters: performance characteristics at 10 GB do not predict behavior at 10 TB, because operator fallback patterns depend on data distributions, not just query structure.

For quick per-notebook testing without a full environment, drop this in your first cell:

%%configure {   "conf": {     "spark.native.enabled": "true"   } }

This takes effect immediately — no session restart required — which makes A/B comparisons trivial.

Step 4: Measure Baselines

You cannot prove improvement without a baseline. For each Tier 1 workload, capture:

Wall-clock duration from the Spark UI (total job time, not stage time — stage time ignores scheduling and shuffle overhead)
CU consumption from Fabric monitoring (this is what your budget cares about)
Spark Advisor warnings in the current state, so you can distinguish new warnings from pre-existing noise after enabling native execution
Row counts and checksums on output tables — correctness verification requires a pre-migration snapshot

Store these in a Delta table. You will reference them for weeks.

Step 5: Run Native and Watch for Fallbacks

Enable the engine on your test environment and run each Tier 1 workload. Then check two things.

Performance delta: Compare wall-clock time and CU consumption against your baselines. On a genuinely compute-heavy workload, you should see at least 1.5x improvement. If you do not, something is triggering fallbacks and you are paying the columnar-to-row conversion tax without getting the native execution benefit.

Fallback alerts: The Spark Advisor now reports real-time warnings during notebook execution when operators fall back from native to JVM execution. Each alert names the specific operator that could not run natively.

The most common fallback trigger, and the most easily fixed: .show(). This call invokes collectLimit and toprettystring, neither of which has a native implementation. Replace .show() with .collect() or .toPandas() in production code. In a notebook cell you run manually for debugging, it does not matter — but inside a scheduled pipeline, every fallback is a boundary crossing.

Other triggers to watch: unsupported expression types, complex nested struct operations, and certain window function variants. For each one, ask three questions:

Can I rewrite the query to avoid it? Sometimes this is a one-line change. Sometimes it means restructuring a transformation.
Is the fallback on a critical path? A fallback in a logging cell is noise. A fallback inside your core join-and-aggregate chain is a problem.
Is the net performance still positive? If the workload runs 3x faster overall despite two fallback warnings on minor operations, accept the win and move on.

Step 6: Validate Data Correctness

Faster means nothing if the answers change. For each migrated workload:

Compare output row counts between native and non-native runs on identical input data
Run hash comparisons on key output columns
For financial or compliance-sensitive pipelines, do a full row-level diff on a representative partition

The Native Execution Engine preserves Spark semantics, but floating-point arithmetic at boundary conditions, null handling in edge cases, and row ordering in non-deterministic operations all deserve explicit verification on your actual data. Do not skip this step because the TPC-DS numbers looked good. TPC-DS does not have your data shapes.

Step 7: Plan Your Rollback

The best operational property of the Native Execution Engine: it can be disabled per cell, per notebook, per environment, instantly. No restarts. No redeployments.

In PySpark:

spark.conf.set('spark.native.enabled', 'false')

In Spark SQL:

SET spark.native.enabled=FALSE;

Your rollback plan is one line of configuration. But that line only helps if your on-call engineers know it exists. Document it. Add it to your runbook. Add it to the incident response template. The worst production regression is one where the fix takes ten seconds but nobody knows about it for two hours.

Step 8: Roll Out Incrementally

With validation complete, enable the engine in production using one of three strategies, ordered from most cautious to broadest:

Option C — Per-job enablement: Add spark.native.enabled=true to individual Spark Job Definitions or notebook configure blocks. You control exactly which workloads get native execution.

Option A — Environment-level: Navigate to your production Environment → Acceleration tab → enable. All notebooks and Spark Job Definitions using this environment inherit the setting.

Option B — Workspace default: Set your native-enabled environment as the workspace default via Workspace Settings → Data Engineering/Science → Environment. Everything in the workspace picks it up.

Start with Option C on your validated Tier 1 workloads. After a week of stable production runs, graduate to Option A. Option B is for teams that have fully validated their workspace and want blanket coverage.

Step 9: Monitor the First Week

Post-migration monitoring matters because production data is not test data. In the first week:

Watch CU consumption trends in Fabric monitoring. Compute-heavy workloads should show measurable drops.
Check the Spark Advisor for fallback warnings that did not appear during testing. Different data distributions or code paths in production can trigger different operators.
Set alerts on job duration. A sudden increase means a new fallback or regression appeared.
Pay attention to any jobs that were borderline in testing. Production-scale data volume can push a workload from “mostly native” to “mostly fallback” if it exercises operators that were uncommon in test data.

Step 10: Optimize for Maximum Native Coverage

Once stable, push further:

Replace all .show() calls with .collect() or .display() in scheduled notebook workflows
Refactor deeply nested struct operations into flat columnar operations where the query logic allows it
Consult the Apache Gluten documentation for the current supported operator list and avoid unsupported expressions in hot paths
Keep data in Parquet or Delta format — the engine processes these natively, and other formats require conversion that erases the acceleration
For write-heavy workloads, leverage the GA-release native Delta write acceleration, which extends native execution into the output path rather than just the read and transform stages

What Does Not Change

Several things remain identical and need no migration planning:

Spark APIs: Your PySpark, Scala, and SQL code is unchanged. The engine operates below the API surface.
Delta Lake semantics: ACID transactions, time travel, schema enforcement — all handled by the same Delta Lake 3.2 layer on Runtime 1.3.
Cost model: No additional CU charges. Your jobs finish faster, so you consume fewer CUs for the same work. The pricing advantage is indirect but real.
Fault tolerance: Spark still manages task retries, stage recovery, and speculative execution. The native engine handles computation; Spark handles resilience.

The Bottom Line

The Native Execution Engine is GA. It runs on the standard Fabric runtime. The performance gains are backed by reproducible benchmarks — up to 4x on TPC-DS at 1 TB, with real-world analytical workloads frequently reaching 6x. It costs nothing to enable and one line of configuration to revert.

But there is a gap between “we turned it on and things got faster” and “we know exactly which workloads got faster, by how much, what fell back, and what to do when something breaks.” The checklist above bridges that gap.

Runtime 1.3. Audit. Baselines. Test. Fallbacks. Correctness. Rollback. Incremental rollout. Monitor. Optimize.

Ten steps. Zero heroics. Measurably faster Spark.

This post was written with help from anthropic/claude-opus-4-6

Spark tuning often starts with the usual suspects (shuffle volume, skew, join strategy, caching)… but sometimes the biggest win is simply executing the same logical plan on a faster engine.

Microsoft Fabric’s Native Execution Engine (NEE) does exactly that: it keeps Spark’s APIs and control plane, but runs a large portion of Spark SQL / DataFrame execution on a vectorized C++ engine.

What NEE is (and why it’s fast)

NEE is a vectorized native engine that integrates into Fabric Spark and can accelerate many SQL/DataFrame operators without you rewriting your code.

You still write Spark SQL / DataFrames.
Spark still handles distributed execution and scheduling.
For supported operators, compute is offloaded to a native engine (reducing JVM overhead and using columnar/vectorized execution).

Fabric documentation calls out NEE as being based on Apache Gluten (the Spark-to-native glue layer) and Velox (the native execution library).

When NEE tends to help the most

NEE shines when your workload is:

SQL-heavy (joins, aggregates, projections, filters)
CPU-bound (compute dominates I/O)
Primarily on Parquet / Delta

You’ll see less benefit (or fallback) when you rely on features NEE doesn’t support yet.

How to enable NEE (3 practical options)

1) Environment-level toggle (recommended for teams)

In your Fabric Environment settings, go to Acceleration and enable the native execution engine, then Save + Publish.

Benefit: notebooks and Spark Job Definitions that use that environment inherit the setting automatically.

2) Enable for a single notebook / job via Spark config

In a notebook cell:

%%configure {   "conf": {     "spark.native.enabled": "true"   } }

For Spark Job Definitions, add the same Spark property.

3) Disable/enable per-query when you hit unsupported features

If a specific query uses an unsupported operator/expression and you want to force JVM Spark for that query:

SET spark.native.enabled=FALSE; -- run the query SET spark.native.enabled=TRUE;

How to confirm NEE is actually being used

Two low-friction checks:

Spark UI / History Server: look for plan nodes ending with Transformer or nodes like *NativeFileScan / VeloxColumnarToRowExec.
df.explain(): the same Transformer / NativeFileScan / Velox… hints should appear in the plan.

Fabric also exposes a dedicated view (“Gluten SQL / DataFrame”) to help spot which queries ran on the native engine vs. fell back.

Fallback is a feature (but you should know the common triggers)

NEE includes an automatic fallback mechanism: if the plan contains unsupported features, Spark will run that portion on the JVM engine.

A few notable limitations called out in Fabric documentation:

UDFs aren’t supported (fallback)
Structured streaming isn’t supported (fallback)
File formats like CSV/JSON/XML aren’t accelerated
ANSI mode isn’t supported

There are also some behavioral differences worth remembering (rounding/casting edge cases) if you have strict numeric expectations.

A pragmatic “NEE-first” optimization workflow

Turn NEE on for the environment (or your job) and rerun the workload.
If it’s still slow, open the plan and answer: is the slow part running on the native engine, or did it fall back?
If it fell back, make the smallest possible change to keep the query on the native path (e.g., avoid UDFs; prefer built-in expressions; standardize on Parquet/Delta).
Once the plan stays mostly native, go back to classic Spark tuning: reduce shuffle volume, fix skew, sane partitioning, and confirm broadcast joins.

References

This post was written with help from ChatGPT 5.2

Tag: velox

Fabric Spark’s Native Execution Engine: What Speeds Up, What Falls Back, and What to Watch

Prerequisite Zero: Understand What You Are Opting Into

Step 1: Confirm You Are on Runtime 1.3

Step 2: Audit Your Workloads

Step 3: Create a Non-Production Test Environment

Step 4: Measure Baselines

Step 5: Run Native and Watch for Fallbacks

Step 6: Validate Data Correctness

Step 7: Plan Your Rollback

Step 8: Roll Out Incrementally

Step 9: Monitor the First Week

Step 10: Optimize for Maximum Native Coverage

What Does Not Change

The Bottom Line

Optimizing Spark Performance with the Native Execution Engine (NEE) in Microsoft Fabric

What NEE is (and why it’s fast)

When NEE tends to help the most

How to enable NEE (3 practical options)

1) Environment-level toggle (recommended for teams)

2) Enable for a single notebook / job via Spark config

3) Disable/enable per-query when you hit unsupported features

How to confirm NEE is actually being used

Fallback is a feature (but you should know the common triggers)

A pragmatic “NEE-first” optimization workflow

References

Prerequisite Zero: Understand What You Are Opting Into

Step 1: Confirm You Are on Runtime 1.3

Step 2: Audit Your Workloads

Step 3: Create a Non-Production Test Environment

Step 4: Measure Baselines

Step 5: Run Native and Watch for Fallbacks

Step 6: Validate Data Correctness

Step 7: Plan Your Rollback

Step 8: Roll Out Incrementally

Step 9: Monitor the First Week

Step 10: Optimize for Maximum Native Coverage

What Does Not Change

The Bottom Line

Share this:

What NEE is (and why it’s fast)

When NEE tends to help the most

How to enable NEE (3 practical options)

1) Environment-level toggle (recommended for teams)

2) Enable for a single notebook / job via Spark config

3) Disable/enable per-query when you hit unsupported features

How to confirm NEE is actually being used

Fallback is a feature (but you should know the common triggers)

A pragmatic “NEE-first” optimization workflow

References

Share this: