
Microsoft Fabric makes it incredibly easy to spin up Spark workloads: notebooks, Lakehouse pipelines, dataflows, SQL + Spark hybrid architectures—the whole buffet.
What’s still hard?
Knowing why a given Spark job is slow, expensive, or flaky.
- A Lakehouse pipeline starts timing out.
- A notebook that used to finish in 5 minutes is now taking 25.
- Costs spike because one model training job is shuffling half the lake.
You open the Spark UI, click around a few stages, stare at shuffle graphs, and say the traditional words of Spark debugging:
“Huh.”
This is where an AI assistant should exist.
In this post, we’ll walk through how to build exactly that for Fabric Spark: a Job Doctor that:
- Reads Spark telemetry from your Fabric environment
- Detects issues like skew, large shuffles, spill, and bad configuration
- Uses a large language model (LLM) to explain what went wrong
- Produces copy-pasteable fixes in Fabric notebooks / pipelines
- Runs inside Fabric using Lakehouses, notebooks, and Azure AI models
This is not a fake product announcement. This is a blueprint you can actually build.
What Is the Fabric “Job Doctor”?
At a high level, the Job Doctor is:
A Fabric-native analytics + AI layer that continuously reads Spark job history, detects common performance anti-patterns, and generates human-readable, prescriptive recommendations.
Concretely, it does three main things:
- Collects Spark job telemetry from Fabric
- Spark application metrics (tasks, stages, shuffles, spills)
- Spark logs & events (Driver/Executor/Event logs)
- Optional query plans
- Spark session configs
- Analyzes jobs using rules + metrics
- Identifies skew, large shuffles, spill, etc.
- Scores each job run and surfaces the top issues.
- Uses an LLM to generate a “diagnosis sheet”
- Root cause in plain English
- Fixes with code + config snippets for Fabric Spark
- Expected impact on performance/cost
Let’s build it step by step, Fabric-style.
Part 1: Getting Spark Telemetry Out of Fabric
Before you can diagnose anything, you need the raw signals. In Fabric, there are three main ways to see what Spark is doing:
- Fabric Apache Spark diagnostic emitter → logs/metrics for each application
- Spark application details (UI / REST)
- In-job logging from notebooks/pipelines (e.g., configs, query plans)
You don’t have to use all three, but combining them gives you enough for a very capable Job Doctor.
1. Configure the Fabric Apache Spark Diagnostic Emitter
The core telemetry pipeline starts with the Fabric Apache Spark diagnostic emitter, configured on a Fabric environment.
At a high level, you:
- Create or use an environment for your Spark workloads.
- Configure one or more diagnostic emitters on that environment.
- Point each emitter to a sink such as:
- Azure Storage (Blob, ADLS)
- Azure Log Analytics
- Azure Event Hubs
For example, an emitter to Azure Storage might be configured (conceptually) like this:
spark.synapse.diagnostic.emitters: MyStorageEmitter
spark.synapse.diagnostic.emitter.MyStorageEmitter.type: AzureStorage
spark.synapse.diagnostic.emitter.MyStorageEmitter.categories: DriverLog,ExecutorLog,EventLog,Metrics
spark.synapse.diagnostic.emitter.MyStorageEmitter.uri: https://<account>.blob.core.windows.net/<container>/<folder>
spark.synapse.diagnostic.emitter.MyStorageEmitter.auth: AccessKey
spark.synapse.diagnostic.emitter.MyStorageEmitter.secret: <storage-access-key>
Once this is in place:
- Every Spark application (notebook, job, pipeline activity that spins up Spark) will emit diagnostic records.
- Those records land as JSON lines describing driver logs, executor logs, Spark listener events, and metrics.
From there, you can:
- If using Storage: Create a shortcut in a Lakehouse pointing at the container/folder.
- If using Log Analytics: Build KQL queries or export into Fabric (e.g., into a KQL DB or as files you later hydrate into a Lakehouse).
We’ll assume the storage pattern for the rest of this post:
Spark app → Fabric environment with diagnostic emitter → Azure Storage → OneLake shortcut → Lakehouse.
2. Shape of the Raw Logs (and Why You’ll Normalize Them)
The emitter doesn’t give you a nice stageId / taskId table out of the box. Instead, you’ll see records like:
{
"timestamp": "2024-05-01T12:34:56Z",
"category": "Metrics",
"applicationId": "app-20240501123456-0001",
"properties": {
"metricName": "executorRunTime",
"stageId": 4,
"taskId": 123,
"value": 9182,
"otherFields": "..."
}
}
Or an EventLog record with a payload that looks like the Spark listener event.
To build a Job Doctor, you’ll:
- Read the JSON lines into Fabric Spark
- Explode / parse the
propertiespayload - Aggregate per-task metrics into per-stage metrics for each application
We’ll skip the exact parsing details (they depend on how you set up the emitter and which events/metrics you enable) and assume that after a normalization job, you have a table with one row per (applicationId, stageId, taskId).
That’s what the next sections use.
3. Capturing Query Plans in Fabric (Optional, but Powerful)
Spark query plans are gold when you’re trying to answer why a stage created a huge shuffle or why a broadcast join didn’t happen.
There isn’t yet a first-class “export query plan as JSON” API in PySpark, but in Fabric notebooks you can use a (semi-internal) trick that works today:
import json
df = ... # some DataFrame you care about
# Advanced / internal: works today but isn't a public, stable API
plan_json = json.loads(df._jdf.queryExecution().toJSON())
You can also log the human-readable plan:
df.explain(mode="formatted") # documented mode, prints a detailed plan
To persist the JSON plan for the Job Doctor, tie it to the Spark application ID:
from pyspark.sql import Row
app_id = spark.sparkContext.applicationId
spark.createDataFrame(
[Row(applicationId=app_id, query_plan_json=plan_json)]
).write.mode("append").saveAsTable("job_doctor.query_plans")
A couple of caveats you should mention in a real blog:
_jdf.queryExecution().toJSON()is not guaranteed to be stable across Spark versions. It’s an advanced, “use at your own risk” trick.- You don’t need to capture plans for every single query—just key bottleneck notebooks or critical pipelines.
Even capturing a subset massively improves the quality of LLM explanations.
4. Capture Spark Config for Each Run
Fabric Spark lets you set configs at:
- Environment / pool level (resource profiles, environment settings)
- Notebook / job level (
spark.conf.set(...)) - Pipeline activity level (Spark job settings)
Inside the running Spark job, you can capture the effective session config like this:
from pyspark.sql import Row
app_id = spark.sparkContext.applicationId
conf_dict = dict(spark.conf.getAll()) # session-level config
config_rows = [
Row(applicationId=app_id, key=k, value=v)
for k, v in conf_dict.items()
]
spark.createDataFrame(config_rows).write.mode("append").saveAsTable("job_doctor.spark_conf")
Now the Job Doctor can say things like:
- “AQE was disabled for this job.”
- “Shuffle partitions was left at default 200, which is low for your data size.”
You’re building a small “Job Doctor mart” inside Fabric:
job_doctor.raw_logs(from emitter)job_doctor.stage_metrics(aggregated)job_doctor.stage_issues(rule engine output)job_doctor.spark_conf(per-application configs)job_doctor.query_plans(optional)
All keyed by applicationId.
Part 2: Loading and Normalizing Spark Logs in a Fabric Lakehouse
Let’s assume you’ve done one-time wiring:
- Azure Storage container with Spark diagnostics
- OneLake shortcut from that container into a Lakehouse
- A Fabric Spark notebook attached to that Lakehouse
From that notebook:
logs_df = spark.read.json("Tables/spark_diagnostics_raw") # or your shortcut path
display(logs_df.limit(10))
You’ll see something like:
timestampcategory(DriverLog, ExecutorLog, EventLog, Metrics, …)applicationIdproperties(nested JSON with stage/task/metric detail)
The normalization step (which you can run as a scheduled pipeline) should:
- Filter down to metrics/events relevant for performance (e.g. task / stage metrics)
- Extract
stageId,taskId,executorRunTime,shuffleReadBytes, etc., into top-level columns - Persist the result as
job_doctor.task_metrics(or similar)
For the rest of this post, we’ll assume you’ve already done that and have a table with columns:
applicationIdstageIdtaskIdexecutorRunTimeshuffleReadBytesshuffleWriteBytesmemoryBytesSpilleddiskBytesSpilled
Aggregating Stage Metrics in Fabric
Now we want to collapse per-task metrics into per-stage metrics per application.
In a Fabric notebook:
from pyspark.sql import functions as F
task_metrics = spark.table("job_doctor.task_metrics")
stage_metrics = (
task_metrics
.groupBy("applicationId", "stageId")
.agg(
F.countDistinct("taskId").alias("num_tasks"),
F.sum("executorRunTime").alias("total_task_runtime_ms"),
# Depending on Spark version, you may need percentile_approx instead
F.expr("percentile(executorRunTime, 0.95)").alias("p95_task_runtime_ms"),
F.max("executorRunTime").alias("max_task_runtime_ms"),
F.sum("shuffleReadBytes").alias("shuffle_read_bytes"),
F.sum("shuffleWriteBytes").alias("shuffle_write_bytes"),
F.sum("memoryBytesSpilled").alias("memory_spill_bytes"),
F.sum("diskBytesSpilled").alias("disk_spill_bytes"),
)
.withColumn(
"skew_ratio",
F.col("max_task_runtime_ms") /
F.when(F.col("p95_task_runtime_ms") == 0, 1).otherwise(F.col("p95_task_runtime_ms"))
)
.withColumn("shuffle_read_mb", F.col("shuffle_read_bytes") / (1024**2))
.withColumn("shuffle_write_mb", F.col("shuffle_write_bytes") / (1024**2))
.withColumn(
"spill_mb",
(F.col("memory_spill_bytes") + F.col("disk_spill_bytes")) / (1024**2)
)
)
stage_metrics.write.mode("overwrite").saveAsTable("job_doctor.stage_metrics")
This gives you a Fabric Lakehouse table with:
skew_ratioshuffle_read_mbshuffle_write_mbspill_mbp95_task_runtime_msnum_tasks,total_task_runtime_ms, etc.
You can run this notebook:
- On a schedule via a Data Pipeline
- Or as a Data Engineering job configured in the workspace
Part 3: Adding a Rule Engine Inside Fabric
Now that the metrics are in a Lakehouse table, let’s add a simple rule engine in Python.
This will run in a Fabric notebook (or job) and write out issues per stage.
from pyspark.sql import Row, functions as F
stage_metrics = spark.table("job_doctor.stage_metrics")
# For simplicity, we'll collect to the driver here.
# This is fine if you don't have thousands of stages.
# For very large workloads, you'd instead do this via a UDF / mapInPandas / explode.
stage_rows = stage_metrics.collect()
Define some basic rules:
def detect_issues(stage_row):
issues = []
# 1. Skew detection
if stage_row.skew_ratio and stage_row.skew_ratio > 5:
issues.append({
"issue_id": "SKEWED_STAGE",
"severity": "High",
"details": f"Skew ratio {stage_row.skew_ratio:.1f}"
})
# 2. Large shuffle
total_shuffle_mb = (stage_row.shuffle_read_mb or 0) + (stage_row.shuffle_write_mb or 0)
if total_shuffle_mb > 10_000: # > 10 GB
issues.append({
"issue_id": "LARGE_SHUFFLE",
"severity": "High",
"details": f"Total shuffle {total_shuffle_mb:.1f} MB"
})
# 3. Excessive spill
if (stage_row.spill_mb or 0) > 1_000: # > 1 GB
issues.append({
"issue_id": "EXCESSIVE_SPILL",
"severity": "Medium",
"details": f"Spill {stage_row.spill_mb:.1f} MB"
})
return issues
Apply the rules and persist the output:
issue_rows = []
for r in stage_rows:
for issue in detect_issues(r):
issue_rows.append(Row(
applicationId=r.applicationId,
stageId=r.stageId,
issue_id=issue["issue_id"],
severity=issue["severity"],
details=issue["details"]
))
issues_df = spark.createDataFrame(issue_rows)
issues_df.write.mode("overwrite").saveAsTable("job_doctor.stage_issues")
Now you have a table of Spark issues detected per run inside your Lakehouse.
Later, the LLM will use these as structured hints.
Part 4: Bringing in the LLM — Turning Metrics into Diagnosis
So far, everything has been pure Spark in Fabric.
Now we want a model (e.g., Azure AI “Models as a Service” endpoint or Azure OpenAI) to turn:
job_doctor.stage_metricsjob_doctor.stage_issuesjob_doctor.spark_confjob_doctor.query_plans
into an actual diagnosis sheet a human can act on.
In Fabric, this is simplest from a Spark notebook using a Python HTTP client.
Below, I’ll show the pattern using an Azure AI serverless model endpoint (the one that uses model: "gpt-4.1" in the body).
1. Prepare the Prompt Payload
First, fetch the data for a single Spark application:
import json
from pyspark.sql import functions as F
app_id = "app-20240501123456-0001" # however you pick which run to diagnose
stages_df = spark.table("job_doctor.stage_metrics").where(F.col("applicationId") == app_id)
issues_df = spark.table("job_doctor.stage_issues").where(F.col("applicationId") == app_id)
conf_df = spark.table("job_doctor.spark_conf").where(F.col("applicationId") == app_id)
plans_df = spark.table("job_doctor.query_plans").where(F.col("applicationId") == app_id)
stages_json = stages_df.toPandas().to_dict(orient="records")
issues_json = issues_df.toPandas().to_dict(orient="records")
conf_json = conf_df.toPandas().to_dict(orient="records")
plans_json = plans_df.toPandas().to_dict(orient="records") # likely 0 or 1 row
Then build a compact but informative prompt:
prompt = f"""
You are an expert in optimizing Apache Spark jobs running on Microsoft Fabric.
Here is summarized telemetry for one Spark application (applicationId={app_id}):
Stage metrics (JSON):
{json.dumps(stages_json, indent=2)}
Detected issues (JSON):
{json.dumps(issues_json, indent=2)}
Spark configuration (key/value list):
{json.dumps(conf_json, indent=2)}
Query plans (optional, may be empty):
{json.dumps(plans_json, indent=2)}
Your tasks:
1. Identify the top 3–5 performance issues for this run.
2. For each, explain the root cause in plain language.
3. Provide concrete fixes tailored for Fabric Spark, including:
- spark.conf settings (for notebooks/jobs)
- suggestions for pipeline settings where relevant
- SQL/DataFrame code snippets
4. Estimate likely performance impact (e.g., "30–50% reduction in runtime").
5. Call out any risky or unsafe changes that should be tested carefully.
Return your answer as markdown.
"""
2. Call an Azure AI Model from Fabric Spark
For the serverless “Models as a Service” endpoint, the pattern looks like this:
import os
import requests
# Example: using Azure AI Models as a Service
# AZURE_AI_ENDPOINT might look like: https://models.inference.ai.azure.com
AZURE_AI_ENDPOINT = os.environ["AZURE_AI_ENDPOINT"]
AZURE_AI_KEY = os.environ["AZURE_AI_KEY"]
MODEL = "gpt-4.1" # or whatever model you've enabled
headers = {
"Content-Type": "application/json",
"api-key": AZURE_AI_KEY,
}
body = {
"model": MODEL,
"messages": [
{"role": "system", "content": "You are a helpful assistant for optimizing Spark jobs on Microsoft Fabric."},
{"role": "user", "content": prompt},
],
}
resp = requests.post(
f"{AZURE_AI_ENDPOINT}/openai/chat/completions",
headers=headers,
json=body,
)
resp.raise_for_status()
diagnosis = resp.json()["choices"][0]["message"]["content"]
If you instead use a provisioned Azure OpenAI resource, the URL shape is slightly different (you call /openai/deployments/<deploymentName>/chat/completions and omit the model field), but the rest of the logic is identical.
At this point, diagnosis is markdown you can:
- Render inline in the notebook with
displayHTML - Save into a Lakehouse table
- Feed into a Fabric semantic model for reporting
Part 5: What the Job Doctor’s Output Looks Like in Fabric
A good Job Doctor output for Fabric Spark might look like this (simplified):
🔎 Issue 1: Skewed Stage 4 (skew ratio 12.3)
What I see
- Stage 4 has a skew ratio of 12.3 (max task runtime vs. p95).
- This stage also reads ~18.2 GB via shuffle, which amplifies the imbalance.
Likely root cause
A join or aggregation keyed on a column where a few values dominate (e.g. a “default” ID, nulls, or a small set of hot keys). One partition ends up doing far more work than the others.
Fabric-specific fixes
In your notebook or job settings, enable Adaptive Query Execution and skew join handling:
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
If the query is in SQL (Lakehouse SQL endpoint), enable AQE at the session/job level through Spark configuration.
If one side of the join is a small dimension table, add a broadcast hint:
SELECT /*+ BROADCAST(dim) */ f.*
FROM fact f
JOIN dim
ON f.key = dim.key;
Estimated impact:
30–50% reduction in total job runtime, depending on how skewed the key distribution is.
📦 Issue 2: Large Shuffle in Stage 2 (~19.7 GB)
What I see
- Stage 2 reads ~19.7 GB via shuffle.
- Shuffle partitions are set to 200 (Spark default).
Likely root cause
A join or aggregation is shuffling nearly the full dataset, but parallelism is low given the data volume. That leads to heavy tasks and increased risk of spill.
Fabric-specific fixes
Increase shuffle partitions for this job:
spark.conf.set("spark.sql.shuffle.partitions", "400")
For pipelines, set this at the Spark activity level under Spark configuration, or through your Fabric environment’s resource profile if you want a new default.
Also consider partitioning by the join key earlier in the pipeline:
df = df.repartition("customer_id")
Estimated impact:
More stable runtimes and reduced likelihood of spill; wall-clock improvements if your underlying capacity has enough cores.
💾 Issue 3: Spill to Disk (~1.8 GB) in Stage 3
What I see
- Stage 3 spills ~1.8 GB to disk.
- This correlates with under-parallelism or memory pressure.
Fabric-specific fixes
- Adjust cluster sizing via Fabric capacity / resource profiles (enough cores + memory per core).
- Increase
spark.sql.shuffle.partitionsas above. - Avoid wide transformations producing huge intermediate rows early in the job; materialize smaller, more selective intermediates first.
You can persist the diagnosis text into a table:
from pyspark.sql import Row
spark.createDataFrame(
[Row(applicationId=app_id, diagnosis_markdown=diagnosis)]
).write.mode("append").saveAsTable("job_doctor.diagnoses")
Then you can build a Power BI report in Fabric bound to:
job_doctor.diagnosesjob_doctor.stage_metricsjob_doctor.stage_issues
to create a “Spark Job Health” dashboard where:
- Rows = recent Spark runs
- Columns = severity, duration, shuffle size, spill, etc.
- A click opens the AI-generated diagnosis for that run
All inside the same workspace.
Part 6: Stitching It All Together in Fabric
Let’s recap the full Fabric-native architecture.
1. Telemetry Ingestion (Environment / Emitter)
- Configure a Fabric environment for your Spark workloads.
- Add a Fabric Apache Spark diagnostic emitter to send logs/metrics to:
- Azure Storage (for Lakehouse shortcuts), or
- Log Analytics / Event Hubs if you prefer KQL or streaming paths.
- (Optional) From notebooks/pipelines, capture:
- Spark configs →
job_doctor.spark_conf - Query plans →
job_doctor.query_plans
- Spark configs →
2. Normalization Job (Spark / Data Pipeline)
- Read raw diagnostics from Storage via a Lakehouse shortcut.
- Parse and flatten the records into per-task metrics.
- Aggregate per-stage metrics →
job_doctor.stage_metrics. - Evaluate rule engine →
job_doctor.stage_issues. - Persist all of this into Lakehouse tables.
3. AI Diagnosis Job (Spark + Azure AI Models)
- For each new (or most expensive / slowest) application:
- Pull stage metrics, issues, configs, and query plans from Lakehouse.
- Construct a structured prompt.
- Call your Azure AI / Azure OpenAI endpoint from a Fabric Spark notebook.
- Store the markdown diagnosis in
job_doctor.diagnoses.
4. User Experience
- Fabric Notebook
- A “Run Job Doctor” cell or button that takes
applicationId, calls the model, and displays the markdown inline.
- A “Run Job Doctor” cell or button that takes
- Data Pipeline / Job
- Scheduled daily to scan all runs from yesterday and generate diagnoses automatically.
- Power BI Report in Fabric
- “Spark Job Health” dashboard showing:
- Top slowest/most expensive jobs
- Detected issues (skew, large shuffle, spill, config problems)
- AI recommendations, side-by-side with raw metrics
- “Spark Job Health” dashboard showing:
Everything lives in one Fabric workspace, using:
- Lakehouses for data
- Spark notebooks / pipelines for processing
- Azure AI models for reasoning
- Power BI for visualization
Why a Fabric-Specific Job Doctor Is Worth Building
Spark is Spark, but in Fabric the story is different:
- Spark jobs are tied closely to Lakehouses, Pipelines, Dataflows, and Power BI.
- You already have a single control plane for capacity, governance, cost, and monitoring.
- Logs, metrics, and reports can live right next to the workloads they describe.
That makes Fabric an ideal home for a Job Doctor:
- No extra infrastructure to stand up
- No random side services to glue together
- The telemetry you need is already flowing; you just have to catch and shape it
- AI can sit directly on top of your Lakehouse + monitoring data
With some Spark, a few Lakehouse tables, and an LLM, you can give every data engineer and analyst in your organization a “Spark performance expert” that’s always on call.
I’ve included a sample notebook you can use to get started on your Job Doctor today!
This post was created with help from (and suggested to me) by ChatGPT Pro using the 5.1 Thinking Model






