Spark UI – Christopher Finlan

2) Pivot to Spark application details for root-cause analysis

Once you identify a problematic run, open the Spark application detail page and work through tabs in order:

Jobs: status, stages, tasks, duration, and processed/read/written data

Resources: executor allocation and utilization in near real time

Logs: inspect Livy, Prelaunch, and Driver logs; download when needed

Item snapshots: confirm exactly what code/parameters/settings were used at execution time

This sequence prevents false fixes where you tune the wrong layer.

4) A lightweight real-time runbook

Confirm scope in the Monitoring hub (single run or systemic pattern)

Open application details for the failing/slower run

Check Jobs for stage/task imbalance and long-running segments

Check Resources for executor pressure

Check Logs for explicit failure signals

Verify snapshots so you debug the exact submitted artifact

Spark performance work is mostly execution work: understanding where the DAG splits into stages, where shuffles happen, and why a handful of tasks can dominate runtime.

This post is a quick, practical refresher on the Spark execution model — with Fabric-specific pointers on where to observe jobs, stages, and tasks.

1) The execution hierarchy: Application → Job → Stage → Task

In Spark, your code runs as a Spark application. When you run an action (for example, count(), collect(), or writing a table), Spark submits a job. Each job is broken into stages, and each stage runs a set of tasks (often one task per partition).

A useful mental model:

Tasks are the unit of parallel work.
Stages group tasks that can run together without needing data from another stage.
Stage boundaries often show up where a shuffle is required (wide dependencies like joins and aggregations).

2) Lazy evaluation: why “nothing happens” until an action

Most DataFrame / Spark SQL transformations are lazy. Spark builds a plan and only executes when an action forces it.

Example (PySpark):

from pyspark.sql.functions import col  df = spark.read.table("fact_sales") # Transformations (lazy) filtered = df.filter(col("sale_date") >= "2026-01-01")  # Action (executes) print(filtered.count())

This matters in Fabric notebooks because a single cell can trigger multiple jobs (for example, one job to materialize a cache and another to write output).

3) Shuffles: the moment your DAG turns expensive

A shuffle is when data must be redistributed across executors (typically by key). Shuffles introduce:

network transfer
disk I/O (shuffle files)
spill risk (memory pressure)
skew/stragglers (a few hot partitions dominate)

If you’re diagnosing a slow pipeline, assume a shuffle is the culprit until proven otherwise.

4) What to check in Fabric: jobs, stages, tasks

Fabric gives you multiple ways to see execution progress:

Notebook contextual monitoring: a progress indicator for notebook cells, with stage/task progress.
Spark monitoring / detail monitoring: drill into a Spark application and see jobs, stages, tasks, and duration breakdowns.

When looking at a slow run, focus on:

stages with large shuffle read/write
long-tail tasks (stragglers)
spill metrics (memory → disk)
skew indicators (a few tasks far slower than the median)

5) A repeatable debugging workflow (that scales)

Start with the plandf.explain(True) for DataFrame/Spark SQL
- Look for Exchange operators (shuffle) and join strategies (broadcast vs shuffle join)
Run once, then open monitoringIdentify the longest stage(s)
- Confirm whether it’s CPU-bound, shuffle-bound, or spill-bound
Apply the common fixes in orderAvoid the shuffle (broadcast small dims)
- Reduce shuffle volume (filter early, select only needed columns)
- Fix partitioning (repartition by join keys; avoid extreme partition counts)
- Turn on AQE (spark.sql.adaptive.enabled=true) to let Spark coalesce shuffle partitions and mitigate skew

Quick checklist

Do I know which stage is dominating runtime?
Is there an Exchange / shuffle boundary causing it?
Are a few tasks straggling (skew), or are all tasks uniformly slow?
Am I broadcasting what should be broadcast?
Is AQE enabled, and is it actually taking effect?

References

This post was written with help from ChatGPT 5.2

Tag: Spark UI

Monitoring Spark Jobs in Real Time in Microsoft Fabric

Why this matters

1) Start at the Monitoring hub for cross-workspace triage

2) Pivot to Spark application details for root-cause analysis

3) Use notebook contextual monitoring while developing

4) A lightweight real-time runbook

Common mistakes to avoid

References

Understanding Spark Execution in Microsoft Fabric

1) The execution hierarchy: Application → Job → Stage → Task

2) Lazy evaluation: why “nothing happens” until an action

3) Shuffles: the moment your DAG turns expensive

4) What to check in Fabric: jobs, stages, tasks

5) A repeatable debugging workflow (that scales)

Quick checklist

References

Why this matters

1) Start at the Monitoring hub for cross-workspace triage

2) Pivot to Spark application details for root-cause analysis

3) Use notebook contextual monitoring while developing

4) A lightweight real-time runbook

Common mistakes to avoid

References

Share this:

1) The execution hierarchy: Application → Job → Stage → Task

2) Lazy evaluation: why “nothing happens” until an action

3) Shuffles: the moment your DAG turns expensive

4) What to check in Fabric: jobs, stages, tasks

5) A repeatable debugging workflow (that scales)

Quick checklist

References

Share this: