Spark – Page 4 – Christopher Finlan

Fabric Spark Shuffle Tuning: AQE + partitions for Faster Joins

Shuffles are where Spark jobs go to get expensive: a wide join or aggregation forces data to move across the network, materialize shuffle files, and often spill when memory pressure spikes.

In Microsoft Fabric Spark workloads, the fastest optimization is usually the boring one: avoid the shuffle when you can, and when you can’t, make it smaller and better balanced.

This post lays out a practical, repeatable approach you can apply in Fabric notebooks and Spark job definitions.

1) Start with the simplest win: avoid the shuffle

If one side of your join is genuinely small (think lookup/dimension tables), use a broadcast join so Spark ships the small table to executors and avoids a full shuffle.

In Fabric’s Spark best practices, Microsoft explicitly calls out broadcast joins for small lookup tables as a way to avoid shuffles entirely.

Example (PySpark):

from pyspark.sql.functions import broadcast

fact = spark.read.table("fact_sales")
dim  = spark.read.table("dim_product")

# If dim_product is small enough, broadcast it
joined = fact.join(broadcast(dim), on="product_id", how="left")

If you can’t broadcast safely, move to the next lever.

2) Make the shuffle less painful: tune shuffle parallelism

Spark controls the number of shuffle partitions for joins and aggregations with spark.sql.shuffle.partitions (default: 200 in Spark SQL).

Too few partitions → huge partitions → long tasks, spills, and stragglers.
Too many partitions → tiny tasks → scheduling overhead, excess shuffle metadata, and unnecessary overhead.

Example (session-level setting):

spark.conf.set("spark.sql.shuffle.partitions", "400")

A decent heuristic is to start with something proportional to total executor cores and then iterate using the Spark UI (watch stage task durations, shuffle read/write sizes, and spill metrics).

3) Let Spark fix itself (when it can): enable AQE

Adaptive Query Execution (AQE) uses runtime statistics to optimize a query as it runs.

Fabric’s Spark best practices recommend enabling AQE to dynamically optimize shuffle partitions and handle skewed data automatically.

AQE is particularly helpful when:

Your input data distribution changes day-to-day
A static spark.sql.shuffle.partitions value is right for some workloads but wrong for others
You hit skew where a small number of partitions do most of the work

Example:

spark.conf.set("spark.sql.adaptive.enabled", "true")

4) Diagnose like you mean it: what to look for in Spark UI

When a job is slow, treat it like a shuffle problem until proven otherwise.

Look for:

Stages where a handful of tasks take dramatically longer than the median (classic skew)
Large shuffle read/write sizes concentrated in a small number of partitions
Spill (memory → disk) spikes during joins/aggregations

When you see skew, your options are usually:

Broadcast (if feasible)
Repartition on a better key
Salt hot keys (advanced)
Enable AQE and confirm it’s actually taking effect

A minimal checklist for Fabric Spark teams

Use DataFrame APIs (keep Catalyst in play).
Broadcast small lookup tables to avoid shuffles.
Set a sane baseline for spark.sql.shuffle.partitions.
Enable AQE and validate in the query plan / UI.
Iterate with the Spark UI: measure, change one thing, re-measure.

References

This post was written with help from ChatGPT 5.2

OneLake Shortcuts + Spark: Practical Patterns for a Single Virtual Lakehouse

If you’ve adopted Microsoft Fabric, there’s a good chance you’re trying to reduce the number of ‘copies’ of data that exist just so different teams and engines can access it.

OneLake shortcuts are one of the core primitives Fabric provides to unify data across domains, clouds, and accounts by making OneLake a single virtual data lake namespace.

For Spark users specifically, the big win is that shortcuts appear as folders in OneLake—so Spark can read them like any other folder—and Delta-format shortcuts in the Lakehouse Tables area can be surfaced as tables.

What a OneLake shortcut is (and isn’t)

A shortcut is an object in OneLake that points to another storage location (internal or external to OneLake).

Shortcuts appear as folders and behave like symbolic links: deleting a shortcut doesn’t delete the target, but moving/renaming/deleting the target can break the shortcut.

From an engineering standpoint, that means you should treat shortcuts as a namespace mapping layer—not as a durability mechanism.

Where you can create shortcuts: Lakehouse Tables vs Files

In a Lakehouse, you create shortcuts either under the top-level Tables folder or anywhere under the Files folder.

Tables has constraints: OneLake doesn’t support shortcuts in subdirectories of the Tables folder, and shortcuts in Tables are typically meant for targets that conform to the Delta table format.

Files is flexible: there are no restrictions on where you can create shortcuts in the Files hierarchy, and table discovery does not happen there.

If a shortcut in the Tables area points to Delta-format data, the lakehouse can synchronize metadata and recognize the folder as a table.

One documented gotcha: the Delta format doesn’t support table names with space characters, and OneLake won’t recognize any shortcut containing a space in the name as a Delta table.

How Spark reads from shortcuts

In notebooks and Spark jobs, shortcuts appear as folders in OneLake, and Spark can read them like any other folder.

For table-shaped data, Fabric automatically recognizes shortcuts in the Tables section of the lakehouse that have Delta/Parquet data as tables—so you can reference them directly from Spark.

Microsoft Learn also notes you can use relative file paths to read data directly from shortcuts, and Delta shortcuts in Tables can be read via Spark SQL syntax.

Practical patterns (what I recommend in real projects)

Pattern 1: Use Tables shortcuts for shared Delta tables you want to show up consistently across Fabric engines (Spark + SQL + Direct Lake scenarios via semantic models reading from shortcuts).

Pattern 2: Use Files shortcuts when you need arbitrary formats or hierarchical layouts (CSV/JSON/images, nested partitions, etc.) and you’re fine treating it as file access.

Pattern 3: Prefer shortcuts over copying/staging when your primary goal is to eliminate edge copies and reduce latency from data duplication workflows.

Pattern 4: When you’re operationalizing Spark notebooks, make the access path explicit and stable by using the shortcut path (the place it appears) rather than hard-coding a target path that might change.

Operational gotchas and guardrails

Because moving/renaming/deleting a target path can break a shortcut, add lightweight monitoring for “broken shortcut” failures in your pipelines (and treat them like dependency failures).

For debugging, the lakehouse UI can show the ABFS path or URL for a shortcut in its Properties pane, which you can copy for inspection or troubleshooting.

Outside of Fabric, services can access OneLake through the OneLake API, which supports a subset of ADLS Gen2 and Blob storage APIs.

Summary

Shortcuts give Spark a clean way to treat OneLake like a unified namespace: read shortcuts as folders, surface Delta/Parquet data in Tables as tables, and keep your project’s logical paths stable even when physical storage locations vary.

References

Unify data sources with OneLake shortcuts (Microsoft Learn)
Access OneLake shortcuts in an Apache Spark notebook (Microsoft Learn)
OneLake access and APIs (Microsoft Learn)

This post was written with help from ChatGPT 5.2

When ‘Native Execution Engine’ Doesn’t Stick: Debugging Fabric Environment Deployments with fabric-cicd

If you’re treating Microsoft Fabric workspaces as source-controlled assets, you’ve probably started leaning on code-first deployment tooling (either Fabric’s built-in Git integration or community tooling layered on top).

One popular option is the open-source fabric-cicd Python library, which is designed to help implement CI/CD automations for Fabric workspaces without having to interact directly with the underlying Fabric APIs.

For most Fabric items, a ‘deploy what’s in Git’ model works well—until you hit a configuration that looks like it’s in source control, appears in deployment logs, but still doesn’t land in the target workspace.

This post walks through a real example from fabric-cicd issue #776: an Environment artifact where the “Enable native execution engine” toggle does not end up enabled after deployment, even though the configuration appears present and the PATCH call returns HTTP 200.

Why this setting matters: environments are the contract for Spark compute

A Fabric environment contains a collection of configurations, including Spark compute properties, that you can attach to notebooks and Spark jobs.

That makes environments a natural CI/CD unit: you can standardize driver/executor sizing, dynamic executor allocation, and Spark properties across many workloads.

Environments are also where Fabric exposes the Native Execution Engine (NEE) toggle under Spark compute → Acceleration.

Microsoft documents that enabling NEE at the environment level causes subsequent jobs and notebooks associated with that environment to inherit the setting.

NEE reads as enabled in source, but ends up disabled in the target

In the report, the Environment’s source-controlled Sparkcompute.yml includes enable_native_execution_engine: true along with driver/executor cores and memory, dynamic executor allocation, Spark properties, and a runtime version.

The user then deploys to a downstream workspace (PPE) using fabric-cicd and expects the deployed Environment to show the Acceleration checkbox enabled.

Instead, the target Environment shows the checkbox unchecked (false), even though the deployment logs indicate that Spark settings were updated.

A key signal in the debug log: PATCH request includes the field, response omits it

The issue includes a fabric-cicd debug snippet showing a PATCH to an environments .../sparkcompute endpoint where the request body contains enableNativeExecutionEngine set to true.

However, the response body shown in the issue includes driver/executor sizing and Spark properties but does not include enableNativeExecutionEngine.

The user further validates the discrepancy by exporting/syncing the PPE workspace back to Git: the resulting Sparkcompute.yml shows enable_native_execution_engine: false.

What to do today: treat NEE as a “verify after deploy” setting

Until the underlying behavior is fixed, assume this flag can drift across environments even when other Spark compute properties deploy correctly.

Practically, that means adding a post-deploy verification step for downstream workspaces—especially if you rely on NEE for predictable performance or cost.

Checklist: a lightweight deployment guardrail

Here’s a low-friction way to catch this class of issue early (even if you don’t have an automated API read-back step yet):

Ensure the source-controlled Sparkcompute.yml includes enable_native_execution_engine: true.
Deploy with verbose/debug logging and confirm the PATCH body contains enableNativeExecutionEngine: true.
After deployment, open the target Environment → Spark compute → Acceleration and verify the checkbox state.
Optionally: export/sync the target workspace back to Git and confirm the exported Sparkcompute.yml matches your intent.

Workarounds (choose your tradeoff)

If you’re blocked, the simplest workaround is operational: enable NEE in the target environment via the UI after deployment and treat it as a manual step until the bug is resolved.

If you need full automation, a more robust approach is to add a post-deploy validation/remediation step that checks the environment setting and re-applies it if it’s not set.

Reporting and tracking

If you’re affected, add reproducibility details (runtime version, library version, auth mode) and any additional debug traces to issue #776 so maintainers can confirm whether the API ignores the field, expects a different contract, or requires a different endpoint/query parameter.

Even if you don’t use fabric-cicd, the pattern is broadly relevant: CI/CD is only reliable when you can round-trip configuration (write, then read-back to verify) for each control surface you’re treating as ‘source of truth.’

Closing thoughts

Native Execution Engine is positioned as a straightforward acceleration you can enable at the environment level to benefit subsequent Spark workloads.

When that toggle doesn’t deploy as expected, the pragmatic response is to verify after deploy, document the drift, and keep your CI/CD pipeline honest by validating the settings you care about—not just the HTTP status code.

References

microsoft/fabric-cicd (GitHub)
fabric-cicd Issue #776
Compute Management in Fabric Environments (Microsoft Learn)
Native execution engine for Fabric Data Engineering (Microsoft Learn)
Native Execution Engine now generally available (Microsoft Fabric Blog)

This post was written with help from ChatGPT 5.2

1) Start with the simplest win: avoid the shuffle

2) Make the shuffle less painful: tune shuffle parallelism

3) Let Spark fix itself (when it can): enable AQE

4) Diagnose like you mean it: what to look for in Spark UI

A minimal checklist for Fabric Spark teams

References

Share this:

What a OneLake shortcut is (and isn’t)

Where you can create shortcuts: Lakehouse Tables vs Files

How Spark reads from shortcuts

Practical patterns (what I recommend in real projects)

Operational gotchas and guardrails

Summary

References

Share this:

Why this setting matters: environments are the contract for Spark compute

NEE reads as enabled in source, but ends up disabled in the target

A key signal in the debug log: PATCH request includes the field, response omits it

What to do today: treat NEE as a “verify after deploy” setting

Checklist: a lightweight deployment guardrail

Workarounds (choose your tradeoff)

Reporting and tracking

Closing thoughts

References

Share this: