Monitoring Spark Jobs in Real Time in Microsoft Fabric

If Spark performance work is surgery, monitoring is your live telemetry.

Microsoft Fabric gives you multiple monitoring entry points for Spark workloads: Monitor hub for cross-item visibility, item Recent runs for focused context, and application detail pages for deep investigation. This post is a practical playbook for using those together.

Why this matters

When a notebook or Spark job definition slows down, “run it again” is the most expensive way to debug. Real-time monitoring helps you:

  • spot bottlenecks while jobs are still running
  • isolate failures quickly
  • compare behavior across submitters and workspaces

1) Start at the Monitoring hub for cross-workspace triage

Use Monitoring in the Fabric navigation pane as your control tower.

  1. Filter by item type (Notebook, Spark job definition, Pipeline)
  2. Narrow by start time and workspace
  3. Sort by duration or status to surface outliers

For broad triage, this is faster than jumping directly into individual notebooks.

2) Pivot to Spark application details for root-cause analysis

Once you identify a problematic run, open the Spark application detail page and work through tabs in order:

  • Jobs: status, stages, tasks, duration, and processed/read/written data
  • Resources: executor allocation and utilization in near real time
  • Logs: inspect Livy, Prelaunch, and Driver logs; download when needed
  • Item snapshots: confirm exactly what code/parameters/settings were used at execution time

This sequence prevents false fixes where you tune the wrong layer.

3) Use notebook contextual monitoring while developing

For iterative tuning, notebook contextual monitoring keeps authoring, execution, and debugging in one place.

  1. Run a target cell/workload
  2. Watch job/stage/task progress and executor behavior
  3. Jump to Spark UI or detail monitoring for deeper traces
  4. Adjust code or config and rerun

4) A lightweight real-time runbook

  • Confirm scope in the Monitoring hub (single run or systemic pattern)
  • Open application details for the failing/slower run
  • Check Jobs for stage/task imbalance and long-running segments
  • Check Resources for executor pressure
  • Check Logs for explicit failure signals
  • Verify snapshots so you debug the exact submitted artifact

Common mistakes to avoid

  • Debugging from memory instead of snapshots
  • Looking only at notebook cell output and skipping Logs/Resources
  • Treating one anomalous run as a global trend without Monitor hub filtering

References

This post was written with help from ChatGPT 5.3

Understanding Spark Execution in Microsoft Fabric

Spark performance work is mostly execution work: understanding where the DAG splits into stages, where shuffles happen, and why a handful of tasks can dominate runtime.

This post is a quick, practical refresher on the Spark execution model — with Fabric-specific pointers on where to observe jobs, stages, and tasks.

1) The execution hierarchy: Application → Job → Stage → Task

In Spark, your code runs as a Spark application. When you run an action (for example, count(), collect(), or writing a table), Spark submits a job. Each job is broken into stages, and each stage runs a set of tasks (often one task per partition).

A useful mental model:

  • Tasks are the unit of parallel work.
  • Stages group tasks that can run together without needing data from another stage.
  • Stage boundaries often show up where a shuffle is required (wide dependencies like joins and aggregations).

2) Lazy evaluation: why “nothing happens” until an action

Most DataFrame / Spark SQL transformations are lazy. Spark builds a plan and only executes when an action forces it.

Example (PySpark):

from pyspark.sql.functions import col

df = spark.read.table("fact_sales")
# Transformations (lazy)
filtered = df.filter(col("sale_date") >= "2026-01-01")

# Action (executes)
print(filtered.count())


This matters in Fabric notebooks because a single cell can trigger multiple jobs (for example, one job to materialize a cache and another to write output).

3) Shuffles: the moment your DAG turns expensive

A shuffle is when data must be redistributed across executors (typically by key). Shuffles introduce:

  • network transfer
  • disk I/O (shuffle files)
  • spill risk (memory pressure)
  • skew/stragglers (a few hot partitions dominate)

If you’re diagnosing a slow pipeline, assume a shuffle is the culprit until proven otherwise.

4) What to check in Fabric: jobs, stages, tasks

Fabric gives you multiple ways to see execution progress:

  • Notebook contextual monitoring: a progress indicator for notebook cells, with stage/task progress.
  • Spark monitoring / detail monitoring: drill into a Spark application and see jobs, stages, tasks, and duration breakdowns.

When looking at a slow run, focus on:

  • stages with large shuffle read/write
  • long-tail tasks (stragglers)
  • spill metrics (memory → disk)
  • skew indicators (a few tasks far slower than the median)

5) A repeatable debugging workflow (that scales)

  1. Start with the plandf.explain(True) for DataFrame/Spark SQL
    • Look for Exchange operators (shuffle) and join strategies (broadcast vs shuffle join)
  2. Run once, then open monitoringIdentify the longest stage(s)
    • Confirm whether it’s CPU-bound, shuffle-bound, or spill-bound
  3. Apply the common fixes in orderAvoid the shuffle (broadcast small dims)
    • Reduce shuffle volume (filter early, select only needed columns)
    • Fix partitioning (repartition by join keys; avoid extreme partition counts)
    • Turn on AQE (spark.sql.adaptive.enabled=true) to let Spark coalesce shuffle partitions and mitigate skew

Quick checklist

  • Do I know which stage is dominating runtime?
  • Is there an Exchange / shuffle boundary causing it?
  • Are a few tasks straggling (skew), or are all tasks uniformly slow?
  • Am I broadcasting what should be broadcast?
  • Is AQE enabled, and is it actually taking effect?

References

This post was written with help from ChatGPT 5.2