Sparkwise: an “automated data engineering specialist” for Fabric Spark tuning

Spark tuning has a way of chewing up time: you start with something that “should be fine,” performance is off, costs creep up, and suddenly you’re deep in configs, Spark UI, and tribal knowledge trying to figure out what actually matters.

That’s why I’m excited to highlight sparkwise, an open-source Python package created by Santhosh Kumar Ravindran, one of my direct reports here at Microsoft. Santhosh built sparkwise to make Spark optimization in Microsoft Fabric less like folklore and more like a repeatable workflow: automated diagnostics, session profiling, and actionable recommendations to help teams drive better price-performance without turning every run into an investigation.

If you’ve ever thought, “I know something’s wrong, but I can’t quickly prove what to change,” sparkwise is aimed squarely at that gap. (PyPI)

As of January 5, 2026, the latest release is sparkwise 1.4.2 on PyPI. (PyPI)


The core idea: stop guessing, start diagnosing

Spark tuning often fails for two reasons:

  1. Too many knobs (Spark, Delta, Fabric-specific settings, runtime behavior).
  2. Not enough feedback (it’s hard to translate symptoms into the few changes that actually matter).

sparkwise attacks both.

It positions itself as an “automated Data Engineering specialist for Apache Spark on Microsoft Fabric,” offering:

  • Intelligent diagnostics
  • Configuration recommendations
  • Comprehensive session profiling
    …so you can get to the best price/performance outcome without turning every notebook run into a science project. (PyPI)

Why sparkwise exists (and the problems it explicitly targets)

From the project description, sparkwise focuses on the stuff that reliably burns time and money in real Fabric Spark work:

  • Cost optimization: detect configurations that waste capacity and extend runtime (PyPI)
  • Performance optimization: validate and enable Fabric-specific acceleration paths like Native Engine and resource profiles (PyPI)
  • Faster iteration: detect Starter Pool blockers that force slower cold starts (3–5 minutes is called out directly) (PyPI)
  • Learning & clarity: interactive Q&A across 133 Spark/Delta/Fabric configurations (PyPI)
  • Workload understanding: profiling across sessions, executors, jobs, and resources (PyPI)
  • Decision support: priority-ranked recommendations with impact analysis (PyPI)

If you’ve ever thought “I know something is off, but I can’t prove which change matters,” this is aimed squarely at you.


What you get: a feature tour that maps to real-world Spark pain

sparkwise’s feature set is broad, but it’s not random. It clusters nicely into a few “jobs to be done.”

1) Automated diagnostics (the fast “what’s wrong?” pass)

The diagnostics layer checks a bunch of high-impact areas, including:

  • Native Execution Engine: verifies Velox usage and detects fallbacks to row-based processing (PyPI)
  • Spark compute: analyzes Starter vs Custom Pool usage and flags immutable configs (PyPI)
  • Data skew detection: identifies imbalanced task distributions (PyPI)
  • Delta optimizations: checks V-Order, deletion vectors, optimize write, auto compaction (PyPI)
  • Runtime tuning: validates AQE, partition sizing, scheduler mode (PyPI)

This is the stuff that tends to produce outsized wins when it’s wrong.

2) Comprehensive profiling (the “what actually happened?” pass)

Once you’re past basic correctness, the next level is: where did time and resources go?

sparkwise includes profiling across:

  • session metadata and resource allocation
  • executor status and memory utilization
  • job/stage/task metrics and bottleneck detection
  • resource efficiency scoring and utilization analysis (PyPI)

3) Advanced performance analysis (built on real metrics)

One of the most interesting “newer” directions in sparkwise is leaning into actual observed execution metrics:

  • “Real metrics collection” using Spark stage/task data (vs estimates) (PyPI)
  • scalability prediction comparing Starter vs Custom Pool with vCore-hour calculations (PyPI)
  • stage timeline visualization (parallel vs sequential patterns) (PyPI)
  • efficiency analysis that quantifies wasted compute in vCore-hours (PyPI)

That’s the bridge between “it feels slow” and “here’s the measurable waste + the fix.”

4) Advanced skew detection (because skew kills Spark)

Skew is one of those problems that can hide behind averages and ruin everything anyway.

sparkwise’s skew tooling includes:

  • straggler detection via task duration variance (PyPI)
  • partition-level analysis with statistical metrics (PyPI)
  • skewed join detection with mitigation suggestions (broadcast vs salting strategies) (PyPI)
  • automatic mitigation guidance with code examples (salting, AQE, broadcast) (PyPI)

5) SQL query plan analysis (spotting anti-patterns early)

For teams living in Spark SQL / DataFrames, this is huge:

  • anti-pattern detection (cartesian products, full scans, excessive shuffles) (PyPI)
  • Native Engine compatibility checks (PyPI)
  • Z-Order recommendations based on cardinality (PyPI)
  • caching opportunity detection for repeated scans/subqueries (PyPI)

6) Storage optimization suite (new in v1.4.0+)

This is one of the clearest “practical ops” expansions:

  • small file detection for Delta tables (default threshold is configurable; example shows <10MB) (PyPI)
  • VACUUM ROI calculator using OneLake pricing assumptions in the project docs (PyPI)
  • partition effectiveness analysis and over/under-partitioning detection (PyPI)
  • “run all storage checks in one command” workflows (PyPI)

In other words: not just “your table is messy,” but “here’s why it costs you, and what to do.”

7) Interactive configuration assistant (the “what does this do?” superpower)

This is deceptively valuable. sparkwise provides:

  • Q&A for 133 documented configurations spanning Spark, Delta, Fabric-specific settings (and Runtime 1.2 configs are called out) (PyPI)
  • context-aware guidance with workload-specific recommendations (PyPI)
  • explicit support for Fabric resource profiles (writeHeavy, readHeavyForSpark, readHeavyForPBI) (PyPI)
  • keyword search across config knowledge (PyPI)

This is the difference between “go read 9 docs” and “ask one question and move on.”


Quick start: the 3 fastest ways to get value

Install

pip install sparkwise

(PyPI)

1) Run a full diagnostic on your current session

from sparkwise import diagnose

diagnose.analyze()

(PyPI)

2) Ask about a specific Spark/Fabric config

from sparkwise import ask

ask.config("spark.native.enabled")
ask.search("optimize")

(PyPI)

3) Profile your run (and pinpoint bottlenecks)

from sparkwise import (
    profile, profile_executors, profile_jobs, profile_resources,
    predict_scalability, show_timeline, analyze_efficiency
)

profile()
profile_executors()
profile_jobs()
profile_resources()

predict_scalability()
show_timeline()
analyze_efficiency()

(PyPI)


CLI workflows (especially useful for storage optimization)

If you prefer CLIs (or want repeatable checks in scripts), sparkwise includes commands like:

sparkwise storage analyze Tables/mytable
sparkwise storage small-files Tables/mytable --threshold 10
sparkwise storage vacuum-roi Tables/mytable --retention-hours 168
sparkwise storage partitions Tables/mytable

(PyPI)

That’s a clean “ops loop” for keeping Delta tables healthy.


A realistic “first hour” workflow I’d recommend

If you’re trying sparkwise on a real Fabric notebook today, here’s a practical order of operations:

  1. Run diagnose.analyze() first
    Use it as your “triage” to catch the high-impact misconfigs (Native Engine fallback, AQE off, Starter Pool blockers). (PyPI)
  2. Use ask.config() for any red/yellow item you don’t fully understand
    The point is speed: read the explanation in context and decide. (PyPI)
  3. Profile the session
    If the job is still slow/expensive after obvious fixes, profile and look for the real culprit: skew, shuffle pressure, poor parallelism, memory pressure. (PyPI)
  4. If the job smells like skew, use advanced skew detection
    Especially for joins and wide aggregations. (PyPI)
  5. If your tables are growing, run storage analysis early
    Small files and weak partitioning quietly tax everything downstream. (PyPI)

That flow is how you turn “tuning” from an art project into a checklist.


Closing: why this matters for Fabric teams

I’m amplifying sparkwise because it’s the kind of contribution that scales beyond the person who wrote it. Santhosh took hard-earned, real-world Fabric Spark tuning experience and turned it into something other engineers can use immediately — a practical way to spot waste, unblock faster iteration, and make smarter performance tradeoffs.

If your team runs Fabric Spark workloads regularly, treat sparkwise like a lightweight tuning partner:

  1. install it,
  2. run the diagnostics,
  3. act on one recommendation,
  4. measure the improvement,
  5. repeat.

And if you end up with feedback or feature ideas, even better — that’s how tools like this get sharper and more broadly useful.

This post was written with help from ChatGPT 5.2

Build Your Own Spark Job Doctor in Microsoft Fabric

Microsoft Fabric makes it incredibly easy to spin up Spark workloads: notebooks, Lakehouse pipelines, dataflows, SQL + Spark hybrid architectures—the whole buffet.

What’s still hard?
Knowing why a given Spark job is slow, expensive, or flaky.

  • A Lakehouse pipeline starts timing out.
  • A notebook that used to finish in 5 minutes is now taking 25.
  • Costs spike because one model training job is shuffling half the lake.

You open the Spark UI, click around a few stages, stare at shuffle graphs, and say the traditional words of Spark debugging:

“Huh.”

This is where an AI assistant should exist.

In this post, we’ll walk through how to build exactly that for Fabric Spark: a Job Doctor that:

  • Reads Spark telemetry from your Fabric environment
  • Detects issues like skew, large shuffles, spill, and bad configuration
  • Uses a large language model (LLM) to explain what went wrong
  • Produces copy-pasteable fixes in Fabric notebooks / pipelines
  • Runs inside Fabric using Lakehouses, notebooks, and Azure AI models

This is not a fake product announcement. This is a blueprint you can actually build.


What Is the Fabric “Job Doctor”?

At a high level, the Job Doctor is:

A Fabric-native analytics + AI layer that continuously reads Spark job history, detects common performance anti-patterns, and generates human-readable, prescriptive recommendations.

Concretely, it does three main things:

  1. Collects Spark job telemetry from Fabric
    • Spark application metrics (tasks, stages, shuffles, spills)
    • Spark logs & events (Driver/Executor/Event logs)
    • Optional query plans
    • Spark session configs
  2. Analyzes jobs using rules + metrics
    • Identifies skew, large shuffles, spill, etc.
    • Scores each job run and surfaces the top issues.
  3. Uses an LLM to generate a “diagnosis sheet”
    • Root cause in plain English
    • Fixes with code + config snippets for Fabric Spark
    • Expected impact on performance/cost

Let’s build it step by step, Fabric-style.


Part 1: Getting Spark Telemetry Out of Fabric

Before you can diagnose anything, you need the raw signals. In Fabric, there are three main ways to see what Spark is doing:

  1. Fabric Apache Spark diagnostic emitter → logs/metrics for each application
  2. Spark application details (UI / REST)
  3. In-job logging from notebooks/pipelines (e.g., configs, query plans)

You don’t have to use all three, but combining them gives you enough for a very capable Job Doctor.


1. Configure the Fabric Apache Spark Diagnostic Emitter

The core telemetry pipeline starts with the Fabric Apache Spark diagnostic emitter, configured on a Fabric environment.

At a high level, you:

  1. Create or use an environment for your Spark workloads.
  2. Configure one or more diagnostic emitters on that environment.
  3. Point each emitter to a sink such as:
    • Azure Storage (Blob, ADLS)
    • Azure Log Analytics
    • Azure Event Hubs

For example, an emitter to Azure Storage might be configured (conceptually) like this:

spark.synapse.diagnostic.emitters: MyStorageEmitter
spark.synapse.diagnostic.emitter.MyStorageEmitter.type: AzureStorage
spark.synapse.diagnostic.emitter.MyStorageEmitter.categories: DriverLog,ExecutorLog,EventLog,Metrics
spark.synapse.diagnostic.emitter.MyStorageEmitter.uri: https://<account>.blob.core.windows.net/<container>/<folder>
spark.synapse.diagnostic.emitter.MyStorageEmitter.auth: AccessKey
spark.synapse.diagnostic.emitter.MyStorageEmitter.secret: <storage-access-key>

Once this is in place:

  • Every Spark application (notebook, job, pipeline activity that spins up Spark) will emit diagnostic records.
  • Those records land as JSON lines describing driver logs, executor logs, Spark listener events, and metrics.

From there, you can:

  • If using Storage: Create a shortcut in a Lakehouse pointing at the container/folder.
  • If using Log Analytics: Build KQL queries or export into Fabric (e.g., into a KQL DB or as files you later hydrate into a Lakehouse).

We’ll assume the storage pattern for the rest of this post:

Spark app → Fabric environment with diagnostic emitter → Azure Storage → OneLake shortcut → Lakehouse.


2. Shape of the Raw Logs (and Why You’ll Normalize Them)

The emitter doesn’t give you a nice stageId / taskId table out of the box. Instead, you’ll see records like:

{
  "timestamp": "2024-05-01T12:34:56Z",
  "category": "Metrics",
  "applicationId": "app-20240501123456-0001",
  "properties": {
    "metricName": "executorRunTime",
    "stageId": 4,
    "taskId": 123,
    "value": 9182,
    "otherFields": "..."
  }
}

Or an EventLog record with a payload that looks like the Spark listener event.

To build a Job Doctor, you’ll:

  1. Read the JSON lines into Fabric Spark
  2. Explode / parse the properties payload
  3. Aggregate per-task metrics into per-stage metrics for each application

We’ll skip the exact parsing details (they depend on how you set up the emitter and which events/metrics you enable) and assume that after a normalization job, you have a table with one row per (applicationId, stageId, taskId).

That’s what the next sections use.


3. Capturing Query Plans in Fabric (Optional, but Powerful)

Spark query plans are gold when you’re trying to answer why a stage created a huge shuffle or why a broadcast join didn’t happen.

There isn’t yet a first-class “export query plan as JSON” API in PySpark, but in Fabric notebooks you can use a (semi-internal) trick that works today:

import json

df = ...  # some DataFrame you care about

# Advanced / internal: works today but isn't a public, stable API
plan_json = json.loads(df._jdf.queryExecution().toJSON())

You can also log the human-readable plan:

df.explain(mode="formatted")  # documented mode, prints a detailed plan

To persist the JSON plan for the Job Doctor, tie it to the Spark application ID:

from pyspark.sql import Row

app_id = spark.sparkContext.applicationId

spark.createDataFrame(
    [Row(applicationId=app_id, query_plan_json=plan_json)]
).write.mode("append").saveAsTable("job_doctor.query_plans")

A couple of caveats you should mention in a real blog:

  • _jdf.queryExecution().toJSON() is not guaranteed to be stable across Spark versions. It’s an advanced, “use at your own risk” trick.
  • You don’t need to capture plans for every single query—just key bottleneck notebooks or critical pipelines.

Even capturing a subset massively improves the quality of LLM explanations.


4. Capture Spark Config for Each Run

Fabric Spark lets you set configs at:

  • Environment / pool level (resource profiles, environment settings)
  • Notebook / job level (spark.conf.set(...))
  • Pipeline activity level (Spark job settings)

Inside the running Spark job, you can capture the effective session config like this:

from pyspark.sql import Row

app_id = spark.sparkContext.applicationId
conf_dict = dict(spark.conf.getAll())  # session-level config

config_rows = [
    Row(applicationId=app_id, key=k, value=v)
    for k, v in conf_dict.items()
]

spark.createDataFrame(config_rows).write.mode("append").saveAsTable("job_doctor.spark_conf")

Now the Job Doctor can say things like:

  • “AQE was disabled for this job.”
  • “Shuffle partitions was left at default 200, which is low for your data size.”

You’re building a small “Job Doctor mart” inside Fabric:

  • job_doctor.raw_logs (from emitter)
  • job_doctor.stage_metrics (aggregated)
  • job_doctor.stage_issues (rule engine output)
  • job_doctor.spark_conf (per-application configs)
  • job_doctor.query_plans (optional)

All keyed by applicationId.


Part 2: Loading and Normalizing Spark Logs in a Fabric Lakehouse

Let’s assume you’ve done one-time wiring:

  • Azure Storage container with Spark diagnostics
  • OneLake shortcut from that container into a Lakehouse
  • A Fabric Spark notebook attached to that Lakehouse

From that notebook:

logs_df = spark.read.json("Tables/spark_diagnostics_raw")  # or your shortcut path
display(logs_df.limit(10))

You’ll see something like:

  • timestamp
  • category (DriverLog, ExecutorLog, EventLog, Metrics, …)
  • applicationId
  • properties (nested JSON with stage/task/metric detail)

The normalization step (which you can run as a scheduled pipeline) should:

  1. Filter down to metrics/events relevant for performance (e.g. task / stage metrics)
  2. Extract stageId, taskId, executorRunTime, shuffleReadBytes, etc., into top-level columns
  3. Persist the result as job_doctor.task_metrics (or similar)

For the rest of this post, we’ll assume you’ve already done that and have a table with columns:

  • applicationId
  • stageId
  • taskId
  • executorRunTime
  • shuffleReadBytes
  • shuffleWriteBytes
  • memoryBytesSpilled
  • diskBytesSpilled

Aggregating Stage Metrics in Fabric

Now we want to collapse per-task metrics into per-stage metrics per application.

In a Fabric notebook:

from pyspark.sql import functions as F

task_metrics = spark.table("job_doctor.task_metrics")

stage_metrics = (
    task_metrics
    .groupBy("applicationId", "stageId")
    .agg(
        F.countDistinct("taskId").alias("num_tasks"),
        F.sum("executorRunTime").alias("total_task_runtime_ms"),
        # Depending on Spark version, you may need percentile_approx instead
        F.expr("percentile(executorRunTime, 0.95)").alias("p95_task_runtime_ms"),
        F.max("executorRunTime").alias("max_task_runtime_ms"),
        F.sum("shuffleReadBytes").alias("shuffle_read_bytes"),
        F.sum("shuffleWriteBytes").alias("shuffle_write_bytes"),
        F.sum("memoryBytesSpilled").alias("memory_spill_bytes"),
        F.sum("diskBytesSpilled").alias("disk_spill_bytes"),
    )
    .withColumn(
        "skew_ratio",
        F.col("max_task_runtime_ms") /
        F.when(F.col("p95_task_runtime_ms") == 0, 1).otherwise(F.col("p95_task_runtime_ms"))
    )
    .withColumn("shuffle_read_mb", F.col("shuffle_read_bytes") / (1024**2))
    .withColumn("shuffle_write_mb", F.col("shuffle_write_bytes") / (1024**2))
    .withColumn(
        "spill_mb",
        (F.col("memory_spill_bytes") + F.col("disk_spill_bytes")) / (1024**2)
    )
)

stage_metrics.write.mode("overwrite").saveAsTable("job_doctor.stage_metrics")

This gives you a Fabric Lakehouse table with:

  • skew_ratio
  • shuffle_read_mb
  • shuffle_write_mb
  • spill_mb
  • p95_task_runtime_ms
  • num_tasks, total_task_runtime_ms, etc.

You can run this notebook:

  • On a schedule via a Data Pipeline
  • Or as a Data Engineering job configured in the workspace

Part 3: Adding a Rule Engine Inside Fabric

Now that the metrics are in a Lakehouse table, let’s add a simple rule engine in Python.

This will run in a Fabric notebook (or job) and write out issues per stage.

from pyspark.sql import Row, functions as F

stage_metrics = spark.table("job_doctor.stage_metrics")

# For simplicity, we'll collect to the driver here.
# This is fine if you don't have thousands of stages.
# For very large workloads, you'd instead do this via a UDF / mapInPandas / explode.
stage_rows = stage_metrics.collect()

Define some basic rules:

def detect_issues(stage_row):
    issues = []

    # 1. Skew detection
    if stage_row.skew_ratio and stage_row.skew_ratio > 5:
        issues.append({
            "issue_id": "SKEWED_STAGE",
            "severity": "High",
            "details": f"Skew ratio {stage_row.skew_ratio:.1f}"
        })

    # 2. Large shuffle
    total_shuffle_mb = (stage_row.shuffle_read_mb or 0) + (stage_row.shuffle_write_mb or 0)
    if total_shuffle_mb > 10_000:  # > 10 GB
        issues.append({
            "issue_id": "LARGE_SHUFFLE",
            "severity": "High",
            "details": f"Total shuffle {total_shuffle_mb:.1f} MB"
        })

    # 3. Excessive spill
    if (stage_row.spill_mb or 0) > 1_000:  # > 1 GB
        issues.append({
            "issue_id": "EXCESSIVE_SPILL",
            "severity": "Medium",
            "details": f"Spill {stage_row.spill_mb:.1f} MB"
        })

    return issues

Apply the rules and persist the output:

issue_rows = []

for r in stage_rows:
    for issue in detect_issues(r):
        issue_rows.append(Row(
            applicationId=r.applicationId,
            stageId=r.stageId,
            issue_id=issue["issue_id"],
            severity=issue["severity"],
            details=issue["details"]
        ))

issues_df = spark.createDataFrame(issue_rows)

issues_df.write.mode("overwrite").saveAsTable("job_doctor.stage_issues")

Now you have a table of Spark issues detected per run inside your Lakehouse.

Later, the LLM will use these as structured hints.


Part 4: Bringing in the LLM — Turning Metrics into Diagnosis

So far, everything has been pure Spark in Fabric.

Now we want a model (e.g., Azure AI “Models as a Service” endpoint or Azure OpenAI) to turn:

  • job_doctor.stage_metrics
  • job_doctor.stage_issues
  • job_doctor.spark_conf
  • job_doctor.query_plans

into an actual diagnosis sheet a human can act on.

In Fabric, this is simplest from a Spark notebook using a Python HTTP client.

Below, I’ll show the pattern using an Azure AI serverless model endpoint (the one that uses model: "gpt-4.1" in the body).


1. Prepare the Prompt Payload

First, fetch the data for a single Spark application:

import json
from pyspark.sql import functions as F

app_id = "app-20240501123456-0001"  # however you pick which run to diagnose

stages_df = spark.table("job_doctor.stage_metrics").where(F.col("applicationId") == app_id)
issues_df = spark.table("job_doctor.stage_issues").where(F.col("applicationId") == app_id)
conf_df   = spark.table("job_doctor.spark_conf").where(F.col("applicationId") == app_id)
plans_df  = spark.table("job_doctor.query_plans").where(F.col("applicationId") == app_id)

stages_json = stages_df.toPandas().to_dict(orient="records")
issues_json = issues_df.toPandas().to_dict(orient="records")
conf_json   = conf_df.toPandas().to_dict(orient="records")
plans_json  = plans_df.toPandas().to_dict(orient="records")  # likely 0 or 1 row

Then build a compact but informative prompt:

prompt = f"""
You are an expert in optimizing Apache Spark jobs running on Microsoft Fabric.

Here is summarized telemetry for one Spark application (applicationId={app_id}):

Stage metrics (JSON):
{json.dumps(stages_json, indent=2)}

Detected issues (JSON):
{json.dumps(issues_json, indent=2)}

Spark configuration (key/value list):
{json.dumps(conf_json, indent=2)}

Query plans (optional, may be empty):
{json.dumps(plans_json, indent=2)}

Your tasks:
1. Identify the top 3–5 performance issues for this run.
2. For each, explain the root cause in plain language.
3. Provide concrete fixes tailored for Fabric Spark, including:
   - spark.conf settings (for notebooks/jobs)
   - suggestions for pipeline settings where relevant
   - SQL/DataFrame code snippets
4. Estimate likely performance impact (e.g., "30–50% reduction in runtime").
5. Call out any risky or unsafe changes that should be tested carefully.

Return your answer as markdown.
"""


2. Call an Azure AI Model from Fabric Spark

For the serverless “Models as a Service” endpoint, the pattern looks like this:

import os
import requests

# Example: using Azure AI Models as a Service
# AZURE_AI_ENDPOINT might look like: https://models.inference.ai.azure.com
AZURE_AI_ENDPOINT = os.environ["AZURE_AI_ENDPOINT"]
AZURE_AI_KEY      = os.environ["AZURE_AI_KEY"]

MODEL = "gpt-4.1"  # or whatever model you've enabled

headers = {
    "Content-Type": "application/json",
    "api-key": AZURE_AI_KEY,
}

body = {
    "model": MODEL,
    "messages": [
        {"role": "system", "content": "You are a helpful assistant for optimizing Spark jobs on Microsoft Fabric."},
        {"role": "user", "content": prompt},
    ],
}

resp = requests.post(
    f"{AZURE_AI_ENDPOINT}/openai/chat/completions",
    headers=headers,
    json=body,
)

resp.raise_for_status()
diagnosis = resp.json()["choices"][0]["message"]["content"]

If you instead use a provisioned Azure OpenAI resource, the URL shape is slightly different (you call /openai/deployments/<deploymentName>/chat/completions and omit the model field), but the rest of the logic is identical.

At this point, diagnosis is markdown you can:

  • Render inline in the notebook with displayHTML
  • Save into a Lakehouse table
  • Feed into a Fabric semantic model for reporting

Part 5: What the Job Doctor’s Output Looks Like in Fabric

A good Job Doctor output for Fabric Spark might look like this (simplified):


🔎 Issue 1: Skewed Stage 4 (skew ratio 12.3)

What I see

  • Stage 4 has a skew ratio of 12.3 (max task runtime vs. p95).
  • This stage also reads ~18.2 GB via shuffle, which amplifies the imbalance.

Likely root cause

A join or aggregation keyed on a column where a few values dominate (e.g. a “default” ID, nulls, or a small set of hot keys). One partition ends up doing far more work than the others.

Fabric-specific fixes

In your notebook or job settings, enable Adaptive Query Execution and skew join handling:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

If the query is in SQL (Lakehouse SQL endpoint), enable AQE at the session/job level through Spark configuration.

If one side of the join is a small dimension table, add a broadcast hint:

SELECT /*+ BROADCAST(dim) */ f.*
FROM fact f
JOIN dim
  ON f.key = dim.key;

Estimated impact:
30–50% reduction in total job runtime, depending on how skewed the key distribution is.


📦 Issue 2: Large Shuffle in Stage 2 (~19.7 GB)

What I see

  • Stage 2 reads ~19.7 GB via shuffle.
  • Shuffle partitions are set to 200 (Spark default).

Likely root cause

A join or aggregation is shuffling nearly the full dataset, but parallelism is low given the data volume. That leads to heavy tasks and increased risk of spill.

Fabric-specific fixes

Increase shuffle partitions for this job:

spark.conf.set("spark.sql.shuffle.partitions", "400")

For pipelines, set this at the Spark activity level under Spark configuration, or through your Fabric environment’s resource profile if you want a new default.

Also consider partitioning by the join key earlier in the pipeline:

df = df.repartition("customer_id")

Estimated impact:
More stable runtimes and reduced likelihood of spill; wall-clock improvements if your underlying capacity has enough cores.


💾 Issue 3: Spill to Disk (~1.8 GB) in Stage 3

What I see

  • Stage 3 spills ~1.8 GB to disk.
  • This correlates with under-parallelism or memory pressure.

Fabric-specific fixes

  • Adjust cluster sizing via Fabric capacity / resource profiles (enough cores + memory per core).
  • Increase spark.sql.shuffle.partitions as above.
  • Avoid wide transformations producing huge intermediate rows early in the job; materialize smaller, more selective intermediates first.

You can persist the diagnosis text into a table:

from pyspark.sql import Row

spark.createDataFrame(
    [Row(applicationId=app_id, diagnosis_markdown=diagnosis)]
).write.mode("append").saveAsTable("job_doctor.diagnoses")

Then you can build a Power BI report in Fabric bound to:

  • job_doctor.diagnoses
  • job_doctor.stage_metrics
  • job_doctor.stage_issues

to create a “Spark Job Health” dashboard where:

  • Rows = recent Spark runs
  • Columns = severity, duration, shuffle size, spill, etc.
  • A click opens the AI-generated diagnosis for that run

All inside the same workspace.


Part 6: Stitching It All Together in Fabric

Let’s recap the full Fabric-native architecture.

1. Telemetry Ingestion (Environment / Emitter)

  • Configure a Fabric environment for your Spark workloads.
  • Add a Fabric Apache Spark diagnostic emitter to send logs/metrics to:
    • Azure Storage (for Lakehouse shortcuts), or
    • Log Analytics / Event Hubs if you prefer KQL or streaming paths.
  • (Optional) From notebooks/pipelines, capture:
    • Spark configs → job_doctor.spark_conf
    • Query plans → job_doctor.query_plans

2. Normalization Job (Spark / Data Pipeline)

  • Read raw diagnostics from Storage via a Lakehouse shortcut.
  • Parse and flatten the records into per-task metrics.
  • Aggregate per-stage metrics → job_doctor.stage_metrics.
  • Evaluate rule engine → job_doctor.stage_issues.
  • Persist all of this into Lakehouse tables.

3. AI Diagnosis Job (Spark + Azure AI Models)

  • For each new (or most expensive / slowest) application:
    • Pull stage metrics, issues, configs, and query plans from Lakehouse.
    • Construct a structured prompt.
    • Call your Azure AI / Azure OpenAI endpoint from a Fabric Spark notebook.
    • Store the markdown diagnosis in job_doctor.diagnoses.

4. User Experience

  • Fabric Notebook
    • A “Run Job Doctor” cell or button that takes applicationId, calls the model, and displays the markdown inline.
  • Data Pipeline / Job
    • Scheduled daily to scan all runs from yesterday and generate diagnoses automatically.
  • Power BI Report in Fabric
    • “Spark Job Health” dashboard showing:
      • Top slowest/most expensive jobs
      • Detected issues (skew, large shuffle, spill, config problems)
      • AI recommendations, side-by-side with raw metrics

Everything lives in one Fabric workspace, using:

  • Lakehouses for data
  • Spark notebooks / pipelines for processing
  • Azure AI models for reasoning
  • Power BI for visualization

Why a Fabric-Specific Job Doctor Is Worth Building

Spark is Spark, but in Fabric the story is different:

  • Spark jobs are tied closely to Lakehouses, Pipelines, Dataflows, and Power BI.
  • You already have a single control plane for capacity, governance, cost, and monitoring.
  • Logs, metrics, and reports can live right next to the workloads they describe.

That makes Fabric an ideal home for a Job Doctor:

  • No extra infrastructure to stand up
  • No random side services to glue together
  • The telemetry you need is already flowing; you just have to catch and shape it
  • AI can sit directly on top of your Lakehouse + monitoring data

With some Spark, a few Lakehouse tables, and an LLM, you can give every data engineer and analyst in your organization a “Spark performance expert” that’s always on call.

I’ve included a sample notebook you can use to get started on your Job Doctor today!


This post was created with help from (and suggested to me) by ChatGPT Pro using the 5.1 Thinking Model

Humans + Machines: From Co-Pilots to Convergence — A Friendly Response to Josh Caplan’s “Interview with AI”

1. Setting the Table

Josh, I loved how you framed your conversation with ChatGPT-4o around three crisp horizons — 5, 25 and 100 years. It’s a structure that forces us to check our near-term expectations against our speculative impulses. Below I’ll walk through each horizon, point out where my own analysis aligns or diverges, and defend those positions with the latest data and research. 

2. Horizon #1 (≈ 2025-2030): The Co-Pilot Decade

Where we agree

You write that “AI will write drafts, summarize meetings, and surface insights … accelerating workflows without replacing human judgment.”    Reality is already catching up:

A May 2025 survey of 645 engineers found 90 % of teams are now using AI tools, up from 61 % a year earlier; 62 % report at least a 25 % productivity boost. 

Early enterprise roll-outs of Microsoft 365 Copilot show time savings of 30–60 minutes per user per day and cycle-time cuts on multi-week processes down to 24 hours. 

These numbers vindicate your “co-pilot” metaphor: narrow-scope models already augment search, summarization and code, freeing humans for higher-order decisions.

Where I’m less sanguine

The same studies point to integration debt: leaders underestimate the cost of securing data pipes, redesigning workflows and upskilling middle management to interpret AI output. Until those invisible costs are budgeted up-front, the productivity bump you forecast could flatten.

3. Horizon #2 (≈ 2050): Partners in Intelligence

Your claim: By 2050 the line between “tool” and “partner” blurs; humans focus on ethics, empathy and strategy while AI scales logic and repetition. 

Supportive evidence

A June 2025 research agenda on AI-first systems argues that autonomous agents will run end-to-end workflows, with humans “supervising, strategizing and acting as ethical stewards.”    The architecture is plausible: agentic stacks, retrieval-augmented memory, and multimodal grounding already exist in prototype.

The labour market caveat

The World Economic Forum’s Future of Jobs 2025 projects 170 million new jobs and 92 million displaced by 2030, for a net gain of 78 million — but also warns that 59 % of current workers will need reskilling.    That tension fuels today’s “Jensen-vs-Dario” debate: Nvidia’s Jensen Huang insists “there will be more jobs,” while Anthropic’s Dario Amodei fears a white-collar bloodbath that could wipe out half of entry-level roles. 

My take: both can be right. Technology will spawn new roles, but only if public- and private-sector reskilling keeps pace with task-level disruption. Without that, we risk a bifurcated workforce of AI super-users and those perpetually catching up.

4. Horizon #3 (≈ 2125): Symbiosis or Overreach?

You envision brain-computer interfaces (BCIs) and digital memory extensions leading to shared intelligence.    The trajectory isn’t science fiction anymore:

Neuralink began human clinical trials in June 2025 and already has five paralyzed patients controlling devices by thought

Scholarly work now focuses less on raw feasibility than on regulating autonomy, mental privacy and identity in next-generation BCIs. 

Where caution is warranted

Hardware failure rates, thread migration in neural tissue, and software-mediated hallucinations all remain unsolved. The moral of the story: physical symbiosis will arrive in layers — therapeutic first, augmentative later — and only under robust oversight.

5. Managing the Transition

6. Closing Thoughts

Josh, your optimism is infectious and, on balance, justified. My friendly amendments are less about dampening that optimism than grounding it in empirics:

Co-pilots already work — but require invisible plumbing and new managerial skills. Partners by 2050 are plausible, provided reskilling outpaces displacement. Symbiosis is a centuries-long marathon, and the ethical scaffolding must be built now.

If we treat literacy, upskilling and governance as first-class engineering problems — not afterthoughts — the future you describe can emerge by design rather than by accident. I look forward to your rebuttal over coffee, human or virtual.

Paginated Report Bear and ChatGPT o3

Why Notebook Snapshots in Microsoft Fabric Are a Debugging Gamechanger—No, Seriously!

If you’ve ever experienced the sheer agony of debugging notebooks—those chaotic, tangled webs of code, markdown, and occasional tears—you’re about to understand exactly why Notebook Snapshots in Microsoft Fabric aren’t just helpful, they’re borderline miraculous. Imagine the emotional rollercoaster of meticulously crafting a beautifully intricate notebook, only to watch it crumble into cryptic errors and obscure stack traces with no clear clue of what went wrong, when, or how. Sound familiar? Welcome to notebook life.

But fear not, weary debugger. Microsoft Fabric is finally here to rescue your productivity—and possibly your sanity—through the absolute genius of Notebook Snapshots.

Let’s Set the Scene: The Notebook Debugging Nightmare

To fully appreciate the brilliance behind Notebook Snapshots, let’s first vividly recall the horrors of debugging notebooks without them.

Step 1: You enthusiastically write and run a series of notebook cells. Everything looks fine—until, mysteriously, it doesn’t.

Step 2: A wild error appears! Frantically, you scroll back up, scratching your head and questioning your life choices. Was it Cell 17, or perhaps Cell 43? Who knows at this point?

Step 3: You begin the tiresome quest of restarting the kernel, selectively re-running cells, attempting to recreate that perfect storm of chaos that birthed the bug. Hours pass, frustration mounts, coffee runs out—disaster ensues.

Sound familiar? Of course it does, we’ve all been there.

Enter Notebook Snapshots: The Hero We Didn’t Know We Needed

Notebook Snapshots in Microsoft Fabric aren’t simply another fancy “nice-to-have” feature; they’re an absolute lifeline for notebook developers. Essentially, Notebook Snapshots capture a complete state of your notebook at a specific point in time—code, outputs, errors, and all. They let you replay and meticulously analyze each step, preserving context like never before.

Think of them as your notebook’s personal rewind button: a time-traveling companion ready to transport you back to that critical moment when everything broke, but your optimism was still intact.

But Why Exactly is This Such a Gamechanger?

Great question—let’s get granular.

1. Precise State Preservation: Say Goodbye to Guesswork

The magic of Notebook Snapshots is in their precision. No more wondering which cell went rogue. Snapshots save the exact state of your notebook’s cells, outputs, variables, and even intermediate data transformations. This precision ensures that you can literally “rewind” and step through execution like you’re binging your favorite Netflix series. Missed something crucial? No worries, just rewind.

  • Benefit: You know exactly what the state was before disaster struck. Debugging transforms from vague guesswork to precise, surgical analysis. You’re no longer stumbling in the dark—you’re debugging in 4K clarity.

2. Faster Issue Replication: Less Coffee, More Debugging

Remember spending hours trying to reproduce obscure bugs that vanished into thin air the moment someone else was watching? Notebook Snapshots eliminate that drama. They capture the bug in action, making it infinitely easier to replicate, analyze, and ultimately squash.

  • Benefit: Debugging time shrinks dramatically. Your colleagues are impressed, your boss is delighted, and your coffee machine finally gets a break.

3. Collaboration Boost: Debug Together, Thrive Together

Notebook Snapshots enable teams to share exact notebook states effortlessly. Imagine sending your team a link that perfectly encapsulates your debugging context. No lengthy explanations needed, no screenshots required, and definitely no more awkward Slack messages like, “Ummm… it was working on my machine?”

  • Benefit: Everyone stays synchronized. Collective debugging becomes simple, fast, and—dare we say it—pleasant.

4. Historical Clarity: The Gift of Hindsight

Snapshots build a rich debugging history. You can examine multiple snapshots over time, comparing exactly how your notebook evolved and where problems emerged. You’re no longer relying on vague memory or frantic notebook archaeology.

  • Benefit: Clearer, smarter decision-making. You become a debugging detective with an archive of evidence at your fingertips.

5. Confidence Boosting: Fearless Experimentation

Knowing you have snapshots lets you innovate fearlessly. Go ahead—experiment wildly! Change parameters, test edge-cases, break things on purpose (just for fun)—because you can always rewind to a known-good state instantly.

  • Benefit: Debugging stops being intimidating. It becomes fun, bold, and explorative.

A Practical Example: Notebook Snapshots in Action

Imagine you’re exploring a complex data pipeline in a notebook:

  • You load and transform data.
  • You run a model.
  • Suddenly, disaster: a cryptic Python exception mocks you cruelly.

Normally, you’d have to painstakingly retrace your steps. With Microsoft Fabric Notebook Snapshots, the workflow is much simpler:

  • Instantly snapshot the notebook at the exact moment the error occurs.
  • Replay each cell execution leading to the error.
  • Examine exactly how data changed between steps—no guessing, just facts.
  • Swiftly isolate the issue, correct the bug, and move on with your life.

Just like that, you’ve gone from notebook-induced stress to complete debugging Zen.

A Bit of Sarcastic Humor for Good Measure

Honestly, if you’re still debugging notebooks without snapshots, it’s a bit like insisting on traveling by horse when teleportation exists. Sure, horses are charmingly nostalgic—but teleportation (aka Notebook Snapshots) is clearly superior, faster, and way less messy.

Or, put differently: debugging notebooks without snapshots in 2025 is like choosing VHS tapes over streaming. Sure, the retro vibes might be fun once—but let’s be honest, who wants to rewind tapes manually when you can simply click and replay?

Wrapping It All Up: Notebooks Just Got a Whole Lot Easier

In short, Notebook Snapshots in Microsoft Fabric aren’t merely a convenience—they fundamentally redefine how we approach notebook debugging. They shift the entire paradigm from guesswork and frustration toward clarity, precision, and confident experimentation.

Notebook developers everywhere can finally rejoice: your debugging nightmares are officially canceled.

Thanks, Microsoft Fabric—you’re genuinely a gamechanger.

This post was written with help from ChatGPT