AQE – Christopher Finlan

Spark performance work is mostly execution work: understanding where the DAG splits into stages, where shuffles happen, and why a handful of tasks can dominate runtime.

This post is a quick, practical refresher on the Spark execution model — with Fabric-specific pointers on where to observe jobs, stages, and tasks.

1) The execution hierarchy: Application → Job → Stage → Task

In Spark, your code runs as a Spark application. When you run an action (for example, count(), collect(), or writing a table), Spark submits a job. Each job is broken into stages, and each stage runs a set of tasks (often one task per partition).

A useful mental model:

Tasks are the unit of parallel work.
Stages group tasks that can run together without needing data from another stage.
Stage boundaries often show up where a shuffle is required (wide dependencies like joins and aggregations).

2) Lazy evaluation: why “nothing happens” until an action

Most DataFrame / Spark SQL transformations are lazy. Spark builds a plan and only executes when an action forces it.

Example (PySpark):

from pyspark.sql.functions import col  df = spark.read.table("fact_sales") # Transformations (lazy) filtered = df.filter(col("sale_date") >= "2026-01-01")  # Action (executes) print(filtered.count())

This matters in Fabric notebooks because a single cell can trigger multiple jobs (for example, one job to materialize a cache and another to write output).

3) Shuffles: the moment your DAG turns expensive

A shuffle is when data must be redistributed across executors (typically by key). Shuffles introduce:

network transfer
disk I/O (shuffle files)
spill risk (memory pressure)
skew/stragglers (a few hot partitions dominate)

If you’re diagnosing a slow pipeline, assume a shuffle is the culprit until proven otherwise.

4) What to check in Fabric: jobs, stages, tasks

Fabric gives you multiple ways to see execution progress:

Notebook contextual monitoring: a progress indicator for notebook cells, with stage/task progress.
Spark monitoring / detail monitoring: drill into a Spark application and see jobs, stages, tasks, and duration breakdowns.

When looking at a slow run, focus on:

stages with large shuffle read/write
long-tail tasks (stragglers)
spill metrics (memory → disk)
skew indicators (a few tasks far slower than the median)

5) A repeatable debugging workflow (that scales)

Start with the plandf.explain(True) for DataFrame/Spark SQL
- Look for Exchange operators (shuffle) and join strategies (broadcast vs shuffle join)
Run once, then open monitoringIdentify the longest stage(s)
- Confirm whether it’s CPU-bound, shuffle-bound, or spill-bound
Apply the common fixes in orderAvoid the shuffle (broadcast small dims)
- Reduce shuffle volume (filter early, select only needed columns)
- Fix partitioning (repartition by join keys; avoid extreme partition counts)
- Turn on AQE (spark.sql.adaptive.enabled=true) to let Spark coalesce shuffle partitions and mitigate skew

Quick checklist

Do I know which stage is dominating runtime?
Is there an Exchange / shuffle boundary causing it?
Are a few tasks straggling (skew), or are all tasks uniformly slow?
Am I broadcasting what should be broadcast?
Is AQE enabled, and is it actually taking effect?

References

This post was written with help from ChatGPT 5.2

Shuffles are where Spark jobs go to get expensive: a wide join or aggregation forces data to move across the network, materialize shuffle files, and often spill when memory pressure spikes.

In Microsoft Fabric Spark workloads, the fastest optimization is usually the boring one: avoid the shuffle when you can, and when you can’t, make it smaller and better balanced.

This post lays out a practical, repeatable approach you can apply in Fabric notebooks and Spark job definitions.

1) Start with the simplest win: avoid the shuffle

If one side of your join is genuinely small (think lookup/dimension tables), use a broadcast join so Spark ships the small table to executors and avoids a full shuffle.

In Fabric’s Spark best practices, Microsoft explicitly calls out broadcast joins for small lookup tables as a way to avoid shuffles entirely.

Example (PySpark):

from pyspark.sql.functions import broadcast

fact = spark.read.table("fact_sales")
dim  = spark.read.table("dim_product")

# If dim_product is small enough, broadcast it
joined = fact.join(broadcast(dim), on="product_id", how="left")

If you can’t broadcast safely, move to the next lever.

2) Make the shuffle less painful: tune shuffle parallelism

Spark controls the number of shuffle partitions for joins and aggregations with spark.sql.shuffle.partitions (default: 200 in Spark SQL).

Too few partitions → huge partitions → long tasks, spills, and stragglers.
Too many partitions → tiny tasks → scheduling overhead, excess shuffle metadata, and unnecessary overhead.

Example (session-level setting):

spark.conf.set("spark.sql.shuffle.partitions", "400")

A decent heuristic is to start with something proportional to total executor cores and then iterate using the Spark UI (watch stage task durations, shuffle read/write sizes, and spill metrics).

3) Let Spark fix itself (when it can): enable AQE

Adaptive Query Execution (AQE) uses runtime statistics to optimize a query as it runs.

Fabric’s Spark best practices recommend enabling AQE to dynamically optimize shuffle partitions and handle skewed data automatically.

AQE is particularly helpful when:

Your input data distribution changes day-to-day
A static spark.sql.shuffle.partitions value is right for some workloads but wrong for others
You hit skew where a small number of partitions do most of the work

Example:

spark.conf.set("spark.sql.adaptive.enabled", "true")

4) Diagnose like you mean it: what to look for in Spark UI

When a job is slow, treat it like a shuffle problem until proven otherwise.

Look for:

Stages where a handful of tasks take dramatically longer than the median (classic skew)
Large shuffle read/write sizes concentrated in a small number of partitions
Spill (memory → disk) spikes during joins/aggregations

When you see skew, your options are usually:

Broadcast (if feasible)
Repartition on a better key
Salt hot keys (advanced)
Enable AQE and confirm it’s actually taking effect

A minimal checklist for Fabric Spark teams

Use DataFrame APIs (keep Catalyst in play).
Broadcast small lookup tables to avoid shuffles.
Set a sane baseline for spark.sql.shuffle.partitions.
Enable AQE and validate in the query plan / UI.
Iterate with the Spark UI: measure, change one thing, re-measure.

References

This post was written with help from ChatGPT 5.2

Tag: AQE

Understanding Spark Execution in Microsoft Fabric

1) The execution hierarchy: Application → Job → Stage → Task

2) Lazy evaluation: why “nothing happens” until an action

3) Shuffles: the moment your DAG turns expensive

4) What to check in Fabric: jobs, stages, tasks

5) A repeatable debugging workflow (that scales)

Quick checklist

References

Fabric Spark Shuffle Tuning: AQE + partitions for Faster Joins

1) Start with the simplest win: avoid the shuffle

2) Make the shuffle less painful: tune shuffle parallelism

3) Let Spark fix itself (when it can): enable AQE

4) Diagnose like you mean it: what to look for in Spark UI

A minimal checklist for Fabric Spark teams

References

1) The execution hierarchy: Application → Job → Stage → Task

2) Lazy evaluation: why “nothing happens” until an action

3) Shuffles: the moment your DAG turns expensive

4) What to check in Fabric: jobs, stages, tasks

5) A repeatable debugging workflow (that scales)

Quick checklist

References

Share this:

1) Start with the simplest win: avoid the shuffle

2) Make the shuffle less painful: tune shuffle parallelism

3) Let Spark fix itself (when it can): enable AQE

4) Diagnose like you mean it: what to look for in Spark UI

A minimal checklist for Fabric Spark teams

References

Share this: