Sparkwise: an “automated data engineering specialist” for Fabric Spark tuning

Spark tuning has a way of chewing up time: you start with something that “should be fine,” performance is off, costs creep up, and suddenly you’re deep in configs, Spark UI, and tribal knowledge trying to figure out what actually matters.

That’s why I’m excited to highlight sparkwise, an open-source Python package created by Santhosh Kumar Ravindran, one of my direct reports here at Microsoft. Santhosh built sparkwise to make Spark optimization in Microsoft Fabric less like folklore and more like a repeatable workflow: automated diagnostics, session profiling, and actionable recommendations to help teams drive better price-performance without turning every run into an investigation.

If you’ve ever thought, “I know something’s wrong, but I can’t quickly prove what to change,” sparkwise is aimed squarely at that gap. (PyPI)

As of January 5, 2026, the latest release is sparkwise 1.4.2 on PyPI. (PyPI)

The core idea: stop guessing, start diagnosing

Spark tuning often fails for two reasons:

Too many knobs (Spark, Delta, Fabric-specific settings, runtime behavior).
Not enough feedback (it’s hard to translate symptoms into the few changes that actually matter).

sparkwise attacks both.

It positions itself as an “automated Data Engineering specialist for Apache Spark on Microsoft Fabric,” offering:

Intelligent diagnostics
Configuration recommendations
Comprehensive session profiling
…so you can get to the best price/performance outcome without turning every notebook run into a science project. (PyPI)

Why sparkwise exists (and the problems it explicitly targets)

From the project description, sparkwise focuses on the stuff that reliably burns time and money in real Fabric Spark work:

Cost optimization: detect configurations that waste capacity and extend runtime (PyPI)
Performance optimization: validate and enable Fabric-specific acceleration paths like Native Engine and resource profiles (PyPI)
Faster iteration: detect Starter Pool blockers that force slower cold starts (3–5 minutes is called out directly) (PyPI)
Learning & clarity: interactive Q&A across 133 Spark/Delta/Fabric configurations (PyPI)
Workload understanding: profiling across sessions, executors, jobs, and resources (PyPI)
Decision support: priority-ranked recommendations with impact analysis (PyPI)

If you’ve ever thought “I know something is off, but I can’t prove which change matters,” this is aimed squarely at you.

What you get: a feature tour that maps to real-world Spark pain

sparkwise’s feature set is broad, but it’s not random. It clusters nicely into a few “jobs to be done.”

1) Automated diagnostics (the fast “what’s wrong?” pass)

The diagnostics layer checks a bunch of high-impact areas, including:

Native Execution Engine: verifies Velox usage and detects fallbacks to row-based processing (PyPI)
Spark compute: analyzes Starter vs Custom Pool usage and flags immutable configs (PyPI)
Data skew detection: identifies imbalanced task distributions (PyPI)
Delta optimizations: checks V-Order, deletion vectors, optimize write, auto compaction (PyPI)
Runtime tuning: validates AQE, partition sizing, scheduler mode (PyPI)

This is the stuff that tends to produce outsized wins when it’s wrong.

2) Comprehensive profiling (the “what actually happened?” pass)

Once you’re past basic correctness, the next level is: where did time and resources go?

sparkwise includes profiling across:

session metadata and resource allocation
executor status and memory utilization
job/stage/task metrics and bottleneck detection
resource efficiency scoring and utilization analysis (PyPI)

3) Advanced performance analysis (built on real metrics)

One of the most interesting “newer” directions in sparkwise is leaning into actual observed execution metrics:

“Real metrics collection” using Spark stage/task data (vs estimates) (PyPI)
scalability prediction comparing Starter vs Custom Pool with vCore-hour calculations (PyPI)
stage timeline visualization (parallel vs sequential patterns) (PyPI)
efficiency analysis that quantifies wasted compute in vCore-hours (PyPI)

That’s the bridge between “it feels slow” and “here’s the measurable waste + the fix.”

4) Advanced skew detection (because skew kills Spark)

Skew is one of those problems that can hide behind averages and ruin everything anyway.

sparkwise’s skew tooling includes:

straggler detection via task duration variance (PyPI)
partition-level analysis with statistical metrics (PyPI)
skewed join detection with mitigation suggestions (broadcast vs salting strategies) (PyPI)
automatic mitigation guidance with code examples (salting, AQE, broadcast) (PyPI)

5) SQL query plan analysis (spotting anti-patterns early)

For teams living in Spark SQL / DataFrames, this is huge:

anti-pattern detection (cartesian products, full scans, excessive shuffles) (PyPI)
Native Engine compatibility checks (PyPI)
Z-Order recommendations based on cardinality (PyPI)
caching opportunity detection for repeated scans/subqueries (PyPI)

6) Storage optimization suite (new in v1.4.0+)

This is one of the clearest “practical ops” expansions:

small file detection for Delta tables (default threshold is configurable; example shows <10MB) (PyPI)
VACUUM ROI calculator using OneLake pricing assumptions in the project docs (PyPI)
partition effectiveness analysis and over/under-partitioning detection (PyPI)
“run all storage checks in one command” workflows (PyPI)

In other words: not just “your table is messy,” but “here’s why it costs you, and what to do.”

7) Interactive configuration assistant (the “what does this do?” superpower)

This is deceptively valuable. sparkwise provides:

Q&A for 133 documented configurations spanning Spark, Delta, Fabric-specific settings (and Runtime 1.2 configs are called out) (PyPI)
context-aware guidance with workload-specific recommendations (PyPI)
explicit support for Fabric resource profiles (writeHeavy, readHeavyForSpark, readHeavyForPBI) (PyPI)
keyword search across config knowledge (PyPI)

This is the difference between “go read 9 docs” and “ask one question and move on.”

Quick start: the 3 fastest ways to get value

Install

pip install sparkwise

(PyPI)

1) Run a full diagnostic on your current session

from sparkwise import diagnose

diagnose.analyze()

(PyPI)

2) Ask about a specific Spark/Fabric config

from sparkwise import ask

ask.config("spark.native.enabled")
ask.search("optimize")

(PyPI)

3) Profile your run (and pinpoint bottlenecks)

from sparkwise import (
    profile, profile_executors, profile_jobs, profile_resources,
    predict_scalability, show_timeline, analyze_efficiency
)

profile()
profile_executors()
profile_jobs()
profile_resources()

predict_scalability()
show_timeline()
analyze_efficiency()

(PyPI)

CLI workflows (especially useful for storage optimization)

If you prefer CLIs (or want repeatable checks in scripts), sparkwise includes commands like:

sparkwise storage analyze Tables/mytable
sparkwise storage small-files Tables/mytable --threshold 10
sparkwise storage vacuum-roi Tables/mytable --retention-hours 168
sparkwise storage partitions Tables/mytable

(PyPI)

That’s a clean “ops loop” for keeping Delta tables healthy.

A realistic “first hour” workflow I’d recommend

If you’re trying sparkwise on a real Fabric notebook today, here’s a practical order of operations:

Run diagnose.analyze() first
Use it as your “triage” to catch the high-impact misconfigs (Native Engine fallback, AQE off, Starter Pool blockers). (PyPI)
Use ask.config() for any red/yellow item you don’t fully understand
The point is speed: read the explanation in context and decide. (PyPI)
Profile the session
If the job is still slow/expensive after obvious fixes, profile and look for the real culprit: skew, shuffle pressure, poor parallelism, memory pressure. (PyPI)
If the job smells like skew, use advanced skew detection
Especially for joins and wide aggregations. (PyPI)
If your tables are growing, run storage analysis early
Small files and weak partitioning quietly tax everything downstream. (PyPI)

That flow is how you turn “tuning” from an art project into a checklist.

Closing: why this matters for Fabric teams

I’m amplifying sparkwise because it’s the kind of contribution that scales beyond the person who wrote it. Santhosh took hard-earned, real-world Fabric Spark tuning experience and turned it into something other engineers can use immediately — a practical way to spot waste, unblock faster iteration, and make smarter performance tradeoffs.

If your team runs Fabric Spark workloads regularly, treat sparkwise like a lightweight tuning partner:

install it,
run the diagnostics,
act on one recommendation,
measure the improvement,
repeat.

And if you end up with feedback or feature ideas, even better — that’s how tools like this get sharper and more broadly useful.

This post was written with help from ChatGPT 5.2

Gil Gerard, Buck Rogers, and the Kind of Grief That Shows Up in December

There are celebrity deaths that feel distant in a way that’s hard to explain without sounding cold. You see the headline, you register it, you think that’s sad, and then the day keeps moving. The world has trained us to process loss at scroll-speed.

But every once in a while, one lands different. It doesn’t feel like news. It feels like someone quietly turned a key inside you and opened a door you forgot existed.

Gil Gerard died, and I wasn’t prepared for how hard it hit.

Maybe that sounds ridiculous to people who didn’t grow up with Buck Rogers in the 25th Century in their bloodstream. Maybe it sounds like nostalgia doing what nostalgia does. But this wasn’t just “an actor I liked.” This was a particular piece of childhood—one of those warm, bright anchors—suddenly becoming something you can only visit, not live alongside.

And it’s December, which makes everything heavier.

The holidays have a way of putting your life on a loop. The same music. The same lights. The same half-remembered rituals you didn’t realize you’d been collecting for decades. This time of year doesn’t just bring memories back; it drags them in by the collar and sets them down in front of you like, Look. Pay attention.

So when I saw the news, it didn’t feel like losing a celebrity.

It felt like losing a doorway.

Buck Rogers wasn’t a show I watched. It was a place I went.

Some shows are entertainment. Some are comfort. And some become the background radiation of your childhood — you don’t even remember the first time you saw them, because they feel like they were always there.

That’s what Buck Rogers was for me.

It was shiny, goofy, sincere, and somehow confident enough to be all three without apologizing. It was the future as imagined by a world that still believed the future could be fun. It had that late-70s/early-80s optimism baked into the sets and the pacing — like even the danger had a little wink in it.

And in the middle of all of that was Gil Gerard.

His Buck wasn’t “perfect hero” energy. He was cocky in a way that felt survivable. He was charming without being smug. He had that specific kind of grin that said: Yeah, this is insane — but we’re gonna be fine. As a kid, that matters more than you realize. A character like that doesn’t just entertain you; he teaches your nervous system what “okay” can feel like.

When you grow up, you start to understand why you clung to that.

Princess Ardala, obviously

And yes — Princess Ardala.

I’ve written about my love for her plenty, and I’m not stopping now. Ardala wasn’t just a villain. She was glamour with teeth. She was command presence and mishelpful desire and that intoxicating confidence that makes you root for someone even when you know better.

She was also part of why the show stuck in my brain the way it did. Ardala made Buck Rogers feel like it had adult electricity under the hood — like it understood that charm and danger can share the same room.

But here’s the thing I don’t think I appreciated until now: Ardala worked because Buck worked.

You need the center to make the orbit matter. You need someone steady enough to make the outrageous feel real. Gil Gerard was that steady. He didn’t overplay it. He didn’t flinch from the camp. He just stood there in the middle of it — smirking, sincere, game for the ride — and that’s what made the whole thing click.

So when he goes, it isn’t just “Buck is gone.” It’s like the whole little universe loses its gravity.

Why it hurts more in the holidays

Because December is already full of ghosts.

It’s the month where you catch yourself standing in a familiar room and realizing time has been moving faster than you’ve wanted to admit. It’s the month where you see an ornament and suddenly remember a person’s laugh. It’s the month where a song can knock the wind out of you in a grocery store aisle.

Holiday nostalgia is sneaky. It doesn’t feel like sadness until it does.

And Gil Gerard’s death—right now, right in the middle of the season that already has you looking backward—feels like a confirmation of something you spend most of the year successfully ignoring:

That childhood is not a place you can go back to. It’s a place you carry. And sometimes, someone you associated with that place disappears, and the weight of it finally shows up.

Not because you knew him.

Because you knew you, back then.

And you miss that kid more than you expected.

What I’m doing with it

I’m not trying to turn this into a big philosophical thing. I’m just being honest about the shape of the grief.

It’s not the grief of losing a family member. It’s not the grief of losing a friend. It’s its own strange category: the grief of realizing another thread connecting you to your early life has been cut.

So I’m going to do the only thing that makes sense.

I’m going to watch an episode.

Not in the “content consumption” way. In the ritual way. The way you replay something not because it’s new, but because it reminds you that you’ve been here before — you’ve felt wonder before, you’ve felt comfort before, you’ve felt the world get a little lighter for an hour before.

I’ll let the show be what it always was: a bright, weird little pocket of imagination that helped shape me.

And I’ll feel the sting of knowing that time only moves one direction.

Rest in peace, Gil Gerard.

Thanks for being a part of the version of the world where the future felt fun — and where I did, too.

This post was written with assistance from ChatGPT 5.2

Build Your Own Spark Job Doctor in Microsoft Fabric

Microsoft Fabric makes it incredibly easy to spin up Spark workloads: notebooks, Lakehouse pipelines, dataflows, SQL + Spark hybrid architectures—the whole buffet.

What’s still hard?
Knowing why a given Spark job is slow, expensive, or flaky.

A Lakehouse pipeline starts timing out.
A notebook that used to finish in 5 minutes is now taking 25.
Costs spike because one model training job is shuffling half the lake.

You open the Spark UI, click around a few stages, stare at shuffle graphs, and say the traditional words of Spark debugging:

“Huh.”

This is where an AI assistant should exist.

In this post, we’ll walk through how to build exactly that for Fabric Spark: a Job Doctor that:

Reads Spark telemetry from your Fabric environment
Detects issues like skew, large shuffles, spill, and bad configuration
Uses a large language model (LLM) to explain what went wrong
Produces copy-pasteable fixes in Fabric notebooks / pipelines
Runs inside Fabric using Lakehouses, notebooks, and Azure AI models

This is not a fake product announcement. This is a blueprint you can actually build.

What Is the Fabric “Job Doctor”?

At a high level, the Job Doctor is:

A Fabric-native analytics + AI layer that continuously reads Spark job history, detects common performance anti-patterns, and generates human-readable, prescriptive recommendations.

Concretely, it does three main things:

Collects Spark job telemetry from Fabric
- Spark application metrics (tasks, stages, shuffles, spills)
- Spark logs & events (Driver/Executor/Event logs)
- Optional query plans
- Spark session configs
Analyzes jobs using rules + metrics
- Identifies skew, large shuffles, spill, etc.
- Scores each job run and surfaces the top issues.
Uses an LLM to generate a “diagnosis sheet”
- Root cause in plain English
- Fixes with code + config snippets for Fabric Spark
- Expected impact on performance/cost

Let’s build it step by step, Fabric-style.

Part 1: Getting Spark Telemetry Out of Fabric

Before you can diagnose anything, you need the raw signals. In Fabric, there are three main ways to see what Spark is doing:

Fabric Apache Spark diagnostic emitter → logs/metrics for each application
Spark application details (UI / REST)
In-job logging from notebooks/pipelines (e.g., configs, query plans)

You don’t have to use all three, but combining them gives you enough for a very capable Job Doctor.

1. Configure the Fabric Apache Spark Diagnostic Emitter

The core telemetry pipeline starts with the Fabric Apache Spark diagnostic emitter, configured on a Fabric environment.

At a high level, you:

Create or use an environment for your Spark workloads.
Configure one or more diagnostic emitters on that environment.
Point each emitter to a sink such as:
- Azure Storage (Blob, ADLS)
- Azure Log Analytics
- Azure Event Hubs

For example, an emitter to Azure Storage might be configured (conceptually) like this:

spark.synapse.diagnostic.emitters: MyStorageEmitter
spark.synapse.diagnostic.emitter.MyStorageEmitter.type: AzureStorage
spark.synapse.diagnostic.emitter.MyStorageEmitter.categories: DriverLog,ExecutorLog,EventLog,Metrics
spark.synapse.diagnostic.emitter.MyStorageEmitter.uri: https://<account>.blob.core.windows.net/<container>/<folder>
spark.synapse.diagnostic.emitter.MyStorageEmitter.auth: AccessKey
spark.synapse.diagnostic.emitter.MyStorageEmitter.secret: <storage-access-key>

Once this is in place:

Every Spark application (notebook, job, pipeline activity that spins up Spark) will emit diagnostic records.
Those records land as JSON lines describing driver logs, executor logs, Spark listener events, and metrics.

From there, you can:

If using Storage: Create a shortcut in a Lakehouse pointing at the container/folder.
If using Log Analytics: Build KQL queries or export into Fabric (e.g., into a KQL DB or as files you later hydrate into a Lakehouse).

We’ll assume the storage pattern for the rest of this post:

Spark app → Fabric environment with diagnostic emitter → Azure Storage → OneLake shortcut → Lakehouse.

2. Shape of the Raw Logs (and Why You’ll Normalize Them)

The emitter doesn’t give you a nice stageId / taskId table out of the box. Instead, you’ll see records like:

{
  "timestamp": "2024-05-01T12:34:56Z",
  "category": "Metrics",
  "applicationId": "app-20240501123456-0001",
  "properties": {
    "metricName": "executorRunTime",
    "stageId": 4,
    "taskId": 123,
    "value": 9182,
    "otherFields": "..."
  }
}

Or an EventLog record with a payload that looks like the Spark listener event.

To build a Job Doctor, you’ll:

Read the JSON lines into Fabric Spark
Explode / parse the properties payload
Aggregate per-task metrics into per-stage metrics for each application

We’ll skip the exact parsing details (they depend on how you set up the emitter and which events/metrics you enable) and assume that after a normalization job, you have a table with one row per (applicationId, stageId, taskId).

That’s what the next sections use.

3. Capturing Query Plans in Fabric (Optional, but Powerful)

Spark query plans are gold when you’re trying to answer why a stage created a huge shuffle or why a broadcast join didn’t happen.

There isn’t yet a first-class “export query plan as JSON” API in PySpark, but in Fabric notebooks you can use a (semi-internal) trick that works today:

import json

df = ...  # some DataFrame you care about

# Advanced / internal: works today but isn't a public, stable API
plan_json = json.loads(df._jdf.queryExecution().toJSON())

You can also log the human-readable plan:

df.explain(mode="formatted")  # documented mode, prints a detailed plan

To persist the JSON plan for the Job Doctor, tie it to the Spark application ID:

from pyspark.sql import Row

app_id = spark.sparkContext.applicationId

spark.createDataFrame(
    [Row(applicationId=app_id, query_plan_json=plan_json)]
).write.mode("append").saveAsTable("job_doctor.query_plans")

A couple of caveats you should mention in a real blog:

_jdf.queryExecution().toJSON() is not guaranteed to be stable across Spark versions. It’s an advanced, “use at your own risk” trick.
You don’t need to capture plans for every single query—just key bottleneck notebooks or critical pipelines.

Even capturing a subset massively improves the quality of LLM explanations.

4. Capture Spark Config for Each Run

Fabric Spark lets you set configs at:

Environment / pool level (resource profiles, environment settings)
Notebook / job level (spark.conf.set(...))
Pipeline activity level (Spark job settings)

Inside the running Spark job, you can capture the effective session config like this:

from pyspark.sql import Row

app_id = spark.sparkContext.applicationId
conf_dict = dict(spark.conf.getAll())  # session-level config

config_rows = [
    Row(applicationId=app_id, key=k, value=v)
    for k, v in conf_dict.items()
]

spark.createDataFrame(config_rows).write.mode("append").saveAsTable("job_doctor.spark_conf")

Now the Job Doctor can say things like:

“AQE was disabled for this job.”
“Shuffle partitions was left at default 200, which is low for your data size.”

You’re building a small “Job Doctor mart” inside Fabric:

job_doctor.raw_logs (from emitter)
job_doctor.stage_metrics (aggregated)
job_doctor.stage_issues (rule engine output)
job_doctor.spark_conf (per-application configs)
job_doctor.query_plans (optional)

All keyed by applicationId.

Part 2: Loading and Normalizing Spark Logs in a Fabric Lakehouse

Let’s assume you’ve done one-time wiring:

Azure Storage container with Spark diagnostics
OneLake shortcut from that container into a Lakehouse
A Fabric Spark notebook attached to that Lakehouse

From that notebook:

logs_df = spark.read.json("Tables/spark_diagnostics_raw")  # or your shortcut path
display(logs_df.limit(10))

You’ll see something like:

timestamp
category (DriverLog, ExecutorLog, EventLog, Metrics, …)
applicationId
properties (nested JSON with stage/task/metric detail)

The normalization step (which you can run as a scheduled pipeline) should:

Filter down to metrics/events relevant for performance (e.g. task / stage metrics)
Extract stageId, taskId, executorRunTime, shuffleReadBytes, etc., into top-level columns
Persist the result as job_doctor.task_metrics (or similar)

For the rest of this post, we’ll assume you’ve already done that and have a table with columns:

applicationId
stageId
taskId
executorRunTime
shuffleReadBytes
shuffleWriteBytes
memoryBytesSpilled
diskBytesSpilled

Aggregating Stage Metrics in Fabric

Now we want to collapse per-task metrics into per-stage metrics per application.

In a Fabric notebook:

from pyspark.sql import functions as F

task_metrics = spark.table("job_doctor.task_metrics")

stage_metrics = (
    task_metrics
    .groupBy("applicationId", "stageId")
    .agg(
        F.countDistinct("taskId").alias("num_tasks"),
        F.sum("executorRunTime").alias("total_task_runtime_ms"),
        # Depending on Spark version, you may need percentile_approx instead
        F.expr("percentile(executorRunTime, 0.95)").alias("p95_task_runtime_ms"),
        F.max("executorRunTime").alias("max_task_runtime_ms"),
        F.sum("shuffleReadBytes").alias("shuffle_read_bytes"),
        F.sum("shuffleWriteBytes").alias("shuffle_write_bytes"),
        F.sum("memoryBytesSpilled").alias("memory_spill_bytes"),
        F.sum("diskBytesSpilled").alias("disk_spill_bytes"),
    )
    .withColumn(
        "skew_ratio",
        F.col("max_task_runtime_ms") /
        F.when(F.col("p95_task_runtime_ms") == 0, 1).otherwise(F.col("p95_task_runtime_ms"))
    )
    .withColumn("shuffle_read_mb", F.col("shuffle_read_bytes") / (1024**2))
    .withColumn("shuffle_write_mb", F.col("shuffle_write_bytes") / (1024**2))
    .withColumn(
        "spill_mb",
        (F.col("memory_spill_bytes") + F.col("disk_spill_bytes")) / (1024**2)
    )
)

stage_metrics.write.mode("overwrite").saveAsTable("job_doctor.stage_metrics")

This gives you a Fabric Lakehouse table with:

skew_ratio
shuffle_read_mb
shuffle_write_mb
spill_mb
p95_task_runtime_ms
num_tasks, total_task_runtime_ms, etc.

You can run this notebook:

On a schedule via a Data Pipeline
Or as a Data Engineering job configured in the workspace

Part 3: Adding a Rule Engine Inside Fabric

Now that the metrics are in a Lakehouse table, let’s add a simple rule engine in Python.

This will run in a Fabric notebook (or job) and write out issues per stage.

from pyspark.sql import Row, functions as F

stage_metrics = spark.table("job_doctor.stage_metrics")

# For simplicity, we'll collect to the driver here.
# This is fine if you don't have thousands of stages.
# For very large workloads, you'd instead do this via a UDF / mapInPandas / explode.
stage_rows = stage_metrics.collect()

Define some basic rules:

def detect_issues(stage_row):
    issues = []

    # 1. Skew detection
    if stage_row.skew_ratio and stage_row.skew_ratio > 5:
        issues.append({
            "issue_id": "SKEWED_STAGE",
            "severity": "High",
            "details": f"Skew ratio {stage_row.skew_ratio:.1f}"
        })

    # 2. Large shuffle
    total_shuffle_mb = (stage_row.shuffle_read_mb or 0) + (stage_row.shuffle_write_mb or 0)
    if total_shuffle_mb > 10_000:  # > 10 GB
        issues.append({
            "issue_id": "LARGE_SHUFFLE",
            "severity": "High",
            "details": f"Total shuffle {total_shuffle_mb:.1f} MB"
        })

    # 3. Excessive spill
    if (stage_row.spill_mb or 0) > 1_000:  # > 1 GB
        issues.append({
            "issue_id": "EXCESSIVE_SPILL",
            "severity": "Medium",
            "details": f"Spill {stage_row.spill_mb:.1f} MB"
        })

    return issues

Apply the rules and persist the output:

issue_rows = []

for r in stage_rows:
    for issue in detect_issues(r):
        issue_rows.append(Row(
            applicationId=r.applicationId,
            stageId=r.stageId,
            issue_id=issue["issue_id"],
            severity=issue["severity"],
            details=issue["details"]
        ))

issues_df = spark.createDataFrame(issue_rows)

issues_df.write.mode("overwrite").saveAsTable("job_doctor.stage_issues")

Now you have a table of Spark issues detected per run inside your Lakehouse.

Later, the LLM will use these as structured hints.

Part 4: Bringing in the LLM — Turning Metrics into Diagnosis

So far, everything has been pure Spark in Fabric.

Now we want a model (e.g., Azure AI “Models as a Service” endpoint or Azure OpenAI) to turn:

job_doctor.stage_metrics
job_doctor.stage_issues
job_doctor.spark_conf
job_doctor.query_plans

into an actual diagnosis sheet a human can act on.

In Fabric, this is simplest from a Spark notebook using a Python HTTP client.

Below, I’ll show the pattern using an Azure AI serverless model endpoint (the one that uses model: "gpt-4.1" in the body).

1. Prepare the Prompt Payload

First, fetch the data for a single Spark application:

import json
from pyspark.sql import functions as F

app_id = "app-20240501123456-0001"  # however you pick which run to diagnose

stages_df = spark.table("job_doctor.stage_metrics").where(F.col("applicationId") == app_id)
issues_df = spark.table("job_doctor.stage_issues").where(F.col("applicationId") == app_id)
conf_df   = spark.table("job_doctor.spark_conf").where(F.col("applicationId") == app_id)
plans_df  = spark.table("job_doctor.query_plans").where(F.col("applicationId") == app_id)

stages_json = stages_df.toPandas().to_dict(orient="records")
issues_json = issues_df.toPandas().to_dict(orient="records")
conf_json   = conf_df.toPandas().to_dict(orient="records")
plans_json  = plans_df.toPandas().to_dict(orient="records")  # likely 0 or 1 row

Then build a compact but informative prompt:

prompt = f"""
You are an expert in optimizing Apache Spark jobs running on Microsoft Fabric.

Here is summarized telemetry for one Spark application (applicationId={app_id}):

Stage metrics (JSON):
{json.dumps(stages_json, indent=2)}

Detected issues (JSON):
{json.dumps(issues_json, indent=2)}

Spark configuration (key/value list):
{json.dumps(conf_json, indent=2)}

Query plans (optional, may be empty):
{json.dumps(plans_json, indent=2)}

Your tasks:
1. Identify the top 3–5 performance issues for this run.
2. For each, explain the root cause in plain language.
3. Provide concrete fixes tailored for Fabric Spark, including:
   - spark.conf settings (for notebooks/jobs)
   - suggestions for pipeline settings where relevant
   - SQL/DataFrame code snippets
4. Estimate likely performance impact (e.g., "30–50% reduction in runtime").
5. Call out any risky or unsafe changes that should be tested carefully.

Return your answer as markdown.
"""

2. Call an Azure AI Model from Fabric Spark

For the serverless “Models as a Service” endpoint, the pattern looks like this:

import os
import requests

# Example: using Azure AI Models as a Service
# AZURE_AI_ENDPOINT might look like: https://models.inference.ai.azure.com
AZURE_AI_ENDPOINT = os.environ["AZURE_AI_ENDPOINT"]
AZURE_AI_KEY      = os.environ["AZURE_AI_KEY"]

MODEL = "gpt-4.1"  # or whatever model you've enabled

headers = {
    "Content-Type": "application/json",
    "api-key": AZURE_AI_KEY,
}

body = {
    "model": MODEL,
    "messages": [
        {"role": "system", "content": "You are a helpful assistant for optimizing Spark jobs on Microsoft Fabric."},
        {"role": "user", "content": prompt},
    ],
}

resp = requests.post(
    f"{AZURE_AI_ENDPOINT}/openai/chat/completions",
    headers=headers,
    json=body,
)

resp.raise_for_status()
diagnosis = resp.json()["choices"][0]["message"]["content"]

If you instead use a provisioned Azure OpenAI resource, the URL shape is slightly different (you call /openai/deployments/<deploymentName>/chat/completions and omit the model field), but the rest of the logic is identical.

At this point, diagnosis is markdown you can:

Render inline in the notebook with displayHTML
Save into a Lakehouse table
Feed into a Fabric semantic model for reporting

Part 5: What the Job Doctor’s Output Looks Like in Fabric

A good Job Doctor output for Fabric Spark might look like this (simplified):

🔎 Issue 1: Skewed Stage 4 (skew ratio 12.3)

What I see

Stage 4 has a skew ratio of 12.3 (max task runtime vs. p95).
This stage also reads ~18.2 GB via shuffle, which amplifies the imbalance.

Likely root cause

A join or aggregation keyed on a column where a few values dominate (e.g. a “default” ID, nulls, or a small set of hot keys). One partition ends up doing far more work than the others.

Fabric-specific fixes

In your notebook or job settings, enable Adaptive Query Execution and skew join handling:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

If the query is in SQL (Lakehouse SQL endpoint), enable AQE at the session/job level through Spark configuration.

If one side of the join is a small dimension table, add a broadcast hint:

SELECT /*+ BROADCAST(dim) */ f.*
FROM fact f
JOIN dim
  ON f.key = dim.key;

Estimated impact:
30–50% reduction in total job runtime, depending on how skewed the key distribution is.

📦 Issue 2: Large Shuffle in Stage 2 (~19.7 GB)

What I see

Stage 2 reads ~19.7 GB via shuffle.
Shuffle partitions are set to 200 (Spark default).

Likely root cause

A join or aggregation is shuffling nearly the full dataset, but parallelism is low given the data volume. That leads to heavy tasks and increased risk of spill.

Fabric-specific fixes

Increase shuffle partitions for this job:

spark.conf.set("spark.sql.shuffle.partitions", "400")

For pipelines, set this at the Spark activity level under Spark configuration, or through your Fabric environment’s resource profile if you want a new default.

Also consider partitioning by the join key earlier in the pipeline:

df = df.repartition("customer_id")

Estimated impact:
More stable runtimes and reduced likelihood of spill; wall-clock improvements if your underlying capacity has enough cores.

💾 Issue 3: Spill to Disk (~1.8 GB) in Stage 3

What I see

Stage 3 spills ~1.8 GB to disk.
This correlates with under-parallelism or memory pressure.

Fabric-specific fixes

Adjust cluster sizing via Fabric capacity / resource profiles (enough cores + memory per core).
Increase spark.sql.shuffle.partitions as above.
Avoid wide transformations producing huge intermediate rows early in the job; materialize smaller, more selective intermediates first.

You can persist the diagnosis text into a table:

from pyspark.sql import Row

spark.createDataFrame(
    [Row(applicationId=app_id, diagnosis_markdown=diagnosis)]
).write.mode("append").saveAsTable("job_doctor.diagnoses")

Then you can build a Power BI report in Fabric bound to:

job_doctor.diagnoses
job_doctor.stage_metrics
job_doctor.stage_issues

to create a “Spark Job Health” dashboard where:

Rows = recent Spark runs
Columns = severity, duration, shuffle size, spill, etc.
A click opens the AI-generated diagnosis for that run

All inside the same workspace.

Part 6: Stitching It All Together in Fabric

Let’s recap the full Fabric-native architecture.

1. Telemetry Ingestion (Environment / Emitter)

Configure a Fabric environment for your Spark workloads.
Add a Fabric Apache Spark diagnostic emitter to send logs/metrics to:
- Azure Storage (for Lakehouse shortcuts), or
- Log Analytics / Event Hubs if you prefer KQL or streaming paths.
(Optional) From notebooks/pipelines, capture:
- Spark configs → job_doctor.spark_conf
- Query plans → job_doctor.query_plans

2. Normalization Job (Spark / Data Pipeline)

Read raw diagnostics from Storage via a Lakehouse shortcut.
Parse and flatten the records into per-task metrics.
Aggregate per-stage metrics → job_doctor.stage_metrics.
Evaluate rule engine → job_doctor.stage_issues.
Persist all of this into Lakehouse tables.

3. AI Diagnosis Job (Spark + Azure AI Models)

For each new (or most expensive / slowest) application:
- Pull stage metrics, issues, configs, and query plans from Lakehouse.
- Construct a structured prompt.
- Call your Azure AI / Azure OpenAI endpoint from a Fabric Spark notebook.
- Store the markdown diagnosis in job_doctor.diagnoses.

4. User Experience

Fabric Notebook
- A “Run Job Doctor” cell or button that takes applicationId, calls the model, and displays the markdown inline.
Data Pipeline / Job
- Scheduled daily to scan all runs from yesterday and generate diagnoses automatically.
Power BI Report in Fabric
- “Spark Job Health” dashboard showing:
  - Top slowest/most expensive jobs
  - Detected issues (skew, large shuffle, spill, config problems)
  - AI recommendations, side-by-side with raw metrics

Everything lives in one Fabric workspace, using:

Lakehouses for data
Spark notebooks / pipelines for processing
Azure AI models for reasoning
Power BI for visualization

Why a Fabric-Specific Job Doctor Is Worth Building

Spark is Spark, but in Fabric the story is different:

Spark jobs are tied closely to Lakehouses, Pipelines, Dataflows, and Power BI.
You already have a single control plane for capacity, governance, cost, and monitoring.
Logs, metrics, and reports can live right next to the workloads they describe.

That makes Fabric an ideal home for a Job Doctor:

No extra infrastructure to stand up
No random side services to glue together
The telemetry you need is already flowing; you just have to catch and shape it
AI can sit directly on top of your Lakehouse + monitoring data

With some Spark, a few Lakehouse tables, and an LLM, you can give every data engineer and analyst in your organization a “Spark performance expert” that’s always on call.

I’ve included a sample notebook you can use to get started on your Job Doctor today!

This post was created with help from (and suggested to me) by ChatGPT Pro using the 5.1 Thinking Model

Passion at Work Is a Double-Edged Sword (And How to Hold It by the Handle, Not the Blade)

We’re told to “follow your passion” like it’s a career cheat code.

Love what you do and you’ll never work a day in your life.

Find your calling.

Do what you’d do for free.

It sounds inspiring. And sometimes, it is true: passion can make work feel meaningful, energizing, and deeply satisfying.

But there’s a shadow side that doesn’t get talked about enough.

Passion at work is a double-edged sword. Held correctly, it can cut through apathy, fear, and mediocrity. Held wrong, it cuts you—your health, your relationships, your boundaries, and even your performance.

This isn’t a call to care less. It’s a call to care wiser.

The Bright Edge: Why Passion Is Powerful

Let’s start with the good news: passion is not the enemy.

1. Passion keeps you going when things are hard

When you actually care about what you’re building, you can push through the boring parts: the documentation, the messy legacy systems, the political nonsense. Passion creates stamina. It’s why some people can do deep work for hours and others are clock-watching at 2:17 p.m.

2. Passion improves the quality of your work

When you’re invested, you notice details other people miss. You think more about edge cases, customer impact, long-term consequences. Passion often shows up as craftsmanship: “this isn’t just done, it’s done right.”

3. Passion makes you more resilient to setbacks

Passionate people bounce back faster from failure. A bad launch, a tough review, a missed promotion hurts—but if you care about the mission, it’s easier to treat it as a data point instead of a verdict on your worth.

4. Passion is contagious

When someone genuinely cares, people feel it. It can pull a team forward. Customers trust you more. Leaders notice your ownership. Passion, when grounded, is a quiet magnet.

All of that is real.

And yet.

The Dark Edge: When Passion Starts Cutting You

Passion becomes dangerous when it slips from “I care a lot” into “I am my work.”

Here’s how that shows up.

1. Your identity fuses with your job

If you’re passionate, it’s easy to start thinking:

“If this project fails, I am a failure.” “If my manager is unhappy, I am not good enough.” “If this company doesn’t appreciate me, maybe I’m not valuable.”

Passion can blur the line between what you do and who you are. Then criticism isn’t feedback on work; it’s an attack on your identity. That’s emotionally exhausting and makes you defensive instead of curious.

2. You become easy to exploit

Harsh truth: workplaces love passionate people—sometimes for the wrong reasons.

If you’re the “I’ll do whatever it takes” person:

You get the late-night emergencies. You pick up slack from weaker teammates. You “volunteer” for stretch work no one else wants. You feel guilty saying no because “this matters.”

The line between commitment and self-betrayal gets blurry. Passion, unmanaged, can turn you into free overtime wrapped in a nice attitude.

3. Burnout hides in plain sight

Passion can mask burnout for a long time because you like the work. You tell yourself:

“I’m just busy right now.” “It’ll calm down after this release / quarter / crisis.” “I don’t need a break; I just need to be more efficient.”

Meanwhile, the signals are there:

You’re always tired, even after weekends. Small setbacks feel like huge emotional blows. You resent people who seem more “chill.” You’re working more but enjoying it less.

By the time you admit you’re burned out, you’re far past the “fix it with a vacation” stage.

4. Passion narrows your vision

When you really care about a project or idea, you can get tunnel vision:

You dismiss risks because “we’ll figure it out.” You take feedback as an attack, not input. You see other teams as blockers, not partners. You overestimate how much others care about your problem.

Passion can make you worse at strategy if it stops you from seeing tradeoffs clearly. Being too attached to a specific solution can blind you to better ones.

5. Emotional volatility becomes the norm

The more passionate you are, the bigger the emotional swings:

Feature shipped? You’re high for a week. Leadership cancels it? You’re crushed for a month. Good performance review? You’re invincible. Reorg? You’re spiraling.

Your nervous system never stabilizes. Work becomes a rollercoaster controlled by people who don’t live inside your head.

The Subtle Trap: Passion as Justification

One of the most dangerous patterns is this:

“I’m exhausted, anxious, and on edge—but that’s the price of caring.”

No. That’s not the price of caring. That’s the price of caring without boundaries.

Passion is not supposed to destroy your sleep, wreck your relationships, or make you hate yourself when something slips. That’s not noble. That’s mismanagement.

You wouldn’t let a junior teammate run production unmonitored with no guardrails. But most passionate people let their emotions do exactly that.

Holding the Sword by the Handle: Healthier Ways to Be Passionate

So what does healthy passion at work look like?

It’s not about caring less. It’s about caring in a way that doesn’t consume you.

Here are some practical shifts.

1. Separate “me” from “my output”

Mentally, you want this frame:

“This work matters to me.” “I’m proud of the effort, decisions, and integrity I bring.” “The outcome is influenced by many factors, some outside my control.”

You can care deeply about quality and impact while still treating outcomes as feedback, not final judgment.

A useful self-check:

“If this project got canceled tomorrow, would I still believe I’m capable and valuable?”

If the honest answer is no, your identity is too fused to the work.

2. Define your own success metrics

When you’re passionate, it’s easy to adopt everyone else’s scoreboard: exec praise, promotion velocity, launch glamour.

Build a second scoreboard that’s yours:

Did I learn something hard this month? Did I push for a decision that needed to be made? Did I support my team in a way I’m proud of? Did I hold a boundary that protected my health?

Those are wins too. They just don’t show up on the OKR dashboard.

3. Make a “portfolio of meaning”

If work is your only source of meaning, every wobble at work feels like an earthquake.

Create a portfolio:

Relationships (family, partners, close friends) Health (sleep, movement, mental hygiene) Personal interests (hobbies, side projects, learning) Contribution outside work (mentoring, community, parenting, etc.)

Passion at work is safest when it’s one important part of your life, not the entire scaffolding holding your self-worth up.

4. Put boundaries on the calendar, not in your head

“I should have better boundaries” is useless if your calendar is a disaster.

Concrete examples:

Block “no meeting” focus time and defend it. Choose 1–2 late nights a week max and keep the rest sacred. Decide in advance when you’ll check email/Slack after hours (if at all). Put workouts, therapy, or walks in your calendar as real appointments.

If it doesn’t exist in time and space, it’s just a wish.

5. Watch your internal narrative

Passion often comes with spicy self-talk:

“If I don’t fix this, everything will fall apart.” “They have no idea how much I’m carrying.” “I can’t slow down; people are counting on me.”

Sometimes that’s true. A lot of times, it’s your brain cosplaying as the lone hero.

Try swapping narratives:

From “I’m the only one who cares” → to “I care a lot, and it’s my job to bring others along, not martyr myself.” From “If I don’t say yes, I’m letting the team down” → to “If I say yes to everything, I’m guaranteeing lower quality for everyone.”

6. Be transparent with your manager (to a point)

You don’t need to pour your entire soul out, but you can say:

“I care a lot about this space and tend to over-extend. I want to stay sustainable. Can we align on where you most want me to go above and beyond, and where ‘good enough’ is genuinely good enough?” “Here’s what I’m currently carrying. If we add X, what do you want me to drop or downgrade?”

Good managers want passionate people to last. If your manager doesn’t… that’s useful information about whether this is the right place to invest your energy.

7. Build a small “reality check” circle

Have 1–3 people who know you well and can tell when your passion is tipping into self-harm. Give them permission to say:

“You’re over-owning this. This isn’t all on you.” “You’re talking like the job is your entire worth.” “You haven’t talked about anything but work in weeks. What’s going on?”

Passion distorts perspective from the inside. You need outside eyes.

The Goal Isn’t to Be Less Passionate

The real goal is:

Strong passion. Clear boundaries. Flexible identity.

You’re allowed to care deeply and still:

Log off. Say no. Change teams or companies. Admit you’re tired. Choose yourself over “the mission” sometimes.

You do your best work when you’re engaged, not when you’re depleted. Passion is fuel, not proof of loyalty.

So don’t dull the sword.

Just learn to hold it by the handle.

This post was written with help from ChatGPT 5.1

Time to Automate: Why Sports Card Grading Needs an AI Revolution

As I head to the National for the first time, this is a topic I have been thinking about for quite some time, and a recent video inspired me to put this together with help from ChatGPT’s o3 model doing deep research. Enjoy!

Introduction: Grading Under the Microscope

Sports card grading is the backbone of the collectibles hobby – a PSA 10 vs PSA 9 on the same card can mean thousands of dollars of difference in value. Yet the process behind those grades has remained stubbornly old-fashioned, relying on human eyes and judgment. In an age of artificial intelligence and computer vision, many are asking: why hasn’t this industry embraced technology for more consistent, transparent results? The sports card grading industry is booming (PSA alone graded 13.5 million items in 2023, commanding ~78% of the market), but its grading methods have seen little modernization. It’s a system well overdue for a shakeup – and AI might be the perfect solution.

The Human Element: Trusted but Inconsistent

For over 30 years, Professional Sports Authenticator (PSA) has set the standard in grading, building a reputation for expertise and consistency . Many collectors trust PSA’s human graders to spot subtle defects and assess a card’s overall appeal in ways a machine allegedly cannot. This trust and track record are why PSA-graded cards often sell for more than those graded by newer, tech-driven companies. Human graders can apply nuanced judgment – understanding vintage card print idiosyncrasies, knowing how an odd factory cut might affect eye appeal, etc. – which some hobbyists still value.

However, the human touch has undeniable downsides. Grading is inherently subjective: two experienced graders might assign different scores to the same card. Mood, fatigue, or unconscious bias can creep in. And the job is essentially a high-volume, low-wage one, meaning even diligent graders face burnout and mistakes in a deluge of submissions. Over the pandemic boom, PSA was receiving over 500,000 cards per week, leading to a backlog of 12+ million cards by early 2021. They had to suspend submissions for months and hire 1,200 new employees to catch up. Relying purely on human labor proved to be a bottleneck – an expensive, slow, and error-prone way to scale. Inconsistencies inevitably arise under such strain, frustrating collectors who crack cards out of their slabs and resubmit them hoping for a higher grade on a luckier day. This “grading lottery” is accepted as part of the hobby, but it shouldn’t be.

Anecdotes of inconsistency abound: Collectors tell stories of a card graded PSA 7 on one submission coming back PSA 8 on another, or vice versa. One hobbyist recounts cracking a high-grade vintage card to try his luck again – only to have it come back with an even lower grade, and eventually marked as “trimmed” by a different company. While such tales may be outliers statistically, they underscore a core point: human grading isn’t perfectly reproducible. As one vintage card expert put it, in a high-volume environment “mistakes every which way will happen” . The lack of consistency not only erodes collector confidence but actively incentivizes wasteful behavior like repeated resubmissions.

Published Standards, Unpredictable Results

What’s ironic is that the major grading companies publish clear grading standards. PSA’s own guide, for instance, specifies that a Gem Mint 10 card must be centered 55/45 or better on the front (no worse than 60/40 for a Mint 9), with only minor flaws like a tiny print spot allowed. Those are numeric thresholds that a computer can measure with pixel precision. Attributes like corner sharpness, edge chipping, and surface gloss might seem more subjective, but they can be quantified too – e.g. by analyzing images for wear patterns or gloss variance. In other words, the criteria for grading a card are largely structured and known.

If an AI system knows that a certain scratch or centering offset knocks a card down to a 9, it will apply that rule uniformly every time. A human, by contrast, might overlook a faint scratch at 5pm on a Friday or be slightly lenient on centering for a popular rookie card. The unpredictability of human grading has real consequences: collectors sometimes play “submitter roulette,” hoping their card catches a grader on a generous day. This unpredictability is so entrenched that an entire subculture of cracking and resubmitting cards exists, attempting to turn PSA 9s into PSA 10s through persistence. It’s a wasteful practice that skews population reports and costs collectors money on extra fees – one that could be curbed if grading outcomes were consistent and repeatable.

A Hobby Tailor-Made for AI

Trading cards are an ideal use-case for AI and computer vision. Unlike, say, comic books or magazines (which have dozens of pages, staples, and complex wear patterns to evaluate), a sports card is a simple, two-sided object of standard size. Grading essentially boils down to assessing four sub-criteria – centering, corners, edges, surface – according to well-defined guidelines. This is exactly the kind of structured visual task that advanced imaging systems excel at. Modern AI can scan a high-resolution image of a card and detect microscopic flaws in an instant. Machine vision doesn’t get tired or biased; it will measure a border centering as 62/38 every time, without rounding up to “approximately 60/40” out of sympathy.

In fact, several companies have proven that the technology is ready. TAG Grading (Technical Authentication & Grading) uses a multi-patented computer vision system to grade cards on a 1,000-point scale that maps to the 1–10 spectrum. Every TAG slab comes with a digital report pinpointing every defect, and the company boldly touts “unrivaled accuracy and consistency” in grading. Similarly, Arena Club (co-founded by Derek Jeter) launched in 2022 promising AI-assisted grading to remove human error. Arena Club’s system scans each card and produces four sub-grades plus an overall grade, with a detailed report of flaws. “You can clearly see why you got your grade,” says Arena’s CTO, highlighting that AI makes grading consistent across different cards and doesn’t depend on the grader. In other words, the same card should always get the same grade – the ultimate goal of any grading process.

Even PSA itself has dabbled in this arena. In early 2021, PSA acquired Genamint Inc., a tech startup focused on automated card diagnostics. The idea was to integrate computer vision that could measure centering, detect surface issues or alterations, and even “fingerprint” each card to track if the same item gets resubmitted. PSA’s leadership acknowledged that bringing in technology would allow them to grade more cards faster while improving accuracy. Notably, one benefit of Genamint’s card fingerprinting is deterring the crack-and-resubmit cycle by recognizing cards that have been graded before. (One can’t help but wonder if eliminating resubmissions – and the extra fees they generate – was truly in PSA’s financial interest, which might explain why this fingerprinting feature isn’t visibly advertised to collectors.)

The point is: AI isn’t some far-off fantasy for card grading – it’s here. Multiple firms have developed working systems that scan cards, apply the known grading criteria, and produce a result with blinding speed and precision. A newly launched outfit, Zeagley Grading, showcased in 2025 a fully automated AI grading platform that checks “thousands of high-resolution checkpoints” on each card’s surface, corners, and edges. Zeagley provides a QR-coded digital report with every slab explaining exactly how the grade was determined, bringing transparency to an area long criticized for its opacity. The system is so confident in its consistency that they’ve offered a public bounty: crack a Zeagley-slabbed card and resubmit it – if it doesn’t come back with the exact same grade, they’ll pay you $1,000. That is the kind of repeatability collectors dream of. It might sound revolutionary, but as Zeagley’s founders themselves put it, “What we’re doing now isn’t groundbreaking at all – it’s what’s coming next that is.” In truth, grading a piece of glossy cardboard with a machine should be straightforward in 2025. We have the tech – it’s the will to use it that’s lagging.

Why the Slow Adoption? (Ulterior Motives?)

If AI grading is so great, why haven’t the big players fully embraced it? The resistance comes from a mix of practical and perhaps self-serving reasons. On the practical side, companies like PSA and Beckett have decades of graded cards in circulation. A sudden shift to machine-grading could introduce slight changes in standards – for example, the AI might technically grade tougher on centering or surface than some human graders have historically. This raises a thorny question: would yesterday’s PSA 10 still be a PSA 10 under a new automated system? The major graders are understandably cautious about undermining the consistency (or at least continuity) of their past population reports. PSA’s leadership has repeatedly stated that their goal is to assist human graders with technology, not replace them. They likely foresee a gradual integration where AI catches the easy stuff – measuring centering, flagging obvious print lines or dents – and humans still make the final judgment calls, keeping a “human touch” in the loop.

But there’s also a more cynical view in hobby circles: the status quo is just too profitable. PSA today is bigger and more powerful than ever – flush with record revenue from the grading boom and enjoying market dominance (grading nearly 4 out of every 5 cards in the hobby ). The lack of consistency in human grading actually drives more business for them. Think about it: if every card got a perfectly objective grade, once and for all, collectors would have little reason to ever resubmit a card or chase a higher grade. The reality today is very different. Many collectors will crack out a PSA 9 and roll the dice again, essentially paying PSA twice (or more) for grading the same card, hoping for that elusive Gem Mint label. There’s an entire cottage industry of group submitters and dealers who bank on finding undergraded cards and bumping them up on resubmission. It’s not far-fetched to suggest that PSA has little incentive to eliminate that lottery aspect of grading. Even PSA’s own Genamint acquisition, which introduced card fingerprinting to catch resubmissions, could be a double-edged sword – if they truly used it to reject previously-graded cards, it might dry up a steady stream of repeat orders. As one commentator wryly observed, “if TAG/AI grading truly becomes a problem [for PSA], PSA would integrate it… but for now it’s not, so we have what we get.” In other words, until the tech-savvy upstarts start eating into PSA’s market share, PSA can afford to move slowly.

There’s also the human factor of collector sentiment. A segment of the hobby simply prefers the traditional approach. The idea of a seasoned grader, someone who has handled vintage Mantles and modern Prizm rookies alike, giving their personal approval still carries weight. Some collectors worry that an algorithm might be too severe, or fail to appreciate an intangible “eye appeal” that a human might allow. PSA’s brand is built not just on plastic slabs, but on the notion that people – trusted experts – are standing behind every grade. Handing that over entirely to machines risks alienating those customers who aren’t ready to trust a computer over a well-known name. As a 2024 article on the subject noted, many in the hobby still see AI grading as lacking the “human touch” and context for certain subjective calls. It will take time for perceptions to change.

Still, these concerns feel less convincing with each passing year. New collectors entering the market (especially from the tech world) are often stunned at how low-tech the grading process remains. Slow, secretive, and expensive is how one new AI grading entrant described the incumbents – pointing to the irony that grading fees can scale up based on card value (PSA charges far more to grade a card worth $50,000 than a $50 card), a practice seen by some as a form of price-gouging. An AI-based service, by contrast, can charge a flat rate per card regardless of value, since the work and cost to the company are the same whether the card is cheap or ultra-valuable. These startups argue they have no conflicts of interest – the algorithm doesn’t know or care what card it’s grading, removing any unconscious bias or temptation to cut corners for high-end clients. In short, technology promises an objective fairness that the current system can’t match.

Upstart Efforts: Tech Takes on the Titans

In the past few years, a number of new grading companies have popped up promising to disrupt the market with technology. Hybrid Grading Approach (HGA) made a splash in 2021 by advertising a “hybrid” model: cards would be initially graded by an AI-driven scanner, then verified by two human graders. HGA also offered flashy custom labels and quicker turnaround times. For a moment, it looked like a strong challenger, but HGA’s momentum stalled amid reports of inconsistent grades and operational missteps (underscoring that fancy tech still needs solid execution behind it).

TAG Grading, mentioned earlier, took a more hardcore tech route – fully computerized grading with proprietary methods and a plethora of data provided to the customer. TAG’s system, however, launched with limitations: initially they would only grade modern cards (1989-present) and standard card sizes, likely because their imaging system needed retraining or reconfiguration for vintage cards, thicker patch cards, die-cuts, etc. This highlights a challenge for any AI approach: it must handle the vast variety of cards in the hobby, from glossy Chrome finish to vintage cardboard, and even odd-shaped or acetates. TAG chose to roll out methodically within its comfort zone. The result has been rave reviews from a small niche – those who tried TAG often praise the “transparent grading report” showing every flaw – but TAG remains a tiny player. Despite delivering what many consider a better mousetrap, they have not come close to denting PSA’s dominance.

Arena Club, backed by a sports icon’s star power, also discovered how tough it is to crack the market. As Arena’s CFO acknowledged, “PSA is dominant, which isn’t news to anyone… it’s definitely going to be a longer road” to convince collectors. Arena pivoted to position itself not just as a grading service but a one-stop marketplace (offering vaulting, trading, even “Slab Pack” digital reveal products). In doing so, they tacitly recognized that trying to go head-to-head purely on grading technology wasn’t enough. Collectors still gravitate to PSA’s brand when it comes time to sell big cards – even if the Arena Club slab has the same card graded 10 with an AI-certified report, many buyers simply trust PSA more. By late 2024, Arena Club boasted that cards in their AI-grade slabs “have sold for almost the same prices as cards graded by PSA” , but “almost the same” implicitly concedes a gap. The market gives PSA a premium, deservedly or not.

New entrants continue to appear. Besides TAG and Arena, we’ve seen firms like AGS (Automated Grading Systems) targeting the Pokémon and TCG crowd with a fully automated “Robograding” service. AGS uses lasers and scanners to find microscopic defects “easily missed by even the best human graders,” and provides sub-scores and images of each flaw. Their pitch is that they grade 10x faster, more accurately, and cheaper – yet their footprint in the sports card realm is still small. The aforementioned Zeagley launched in mid-2025 with a flurry of press, even offering on-site instant grading demos at card shows. Time will tell if they fare any better. So far, each tech-focused upstart has either struggled to gain trust or found itself constrained to a niche, while PSA is grading more cards than ever (up 21% in volume last year ) and even raising prices for premium services. In effect, the incumbents have been able to watch these challengers from a position of strength and learn from their mistakes.

PSA: Bigger Than Ever, But Is It Better?

It’s worth noting that PSA hasn’t been entirely tech-averse. They use advanced scanners at intake, have implemented card fingerprinting and alteration-detection algorithms (courtesy of Genamint) behind the scenes, and likely use software to assist with centering measurements. Nat Turner, who leads PSA’s parent company, is a tech entrepreneur himself and clearly sees the long-term importance of innovation. But from an outsider’s perspective, PSA’s grading process in 2025 doesn’t look dramatically different to customers than it did a decade ago: you send your cards in, human graders assign a 1–10 grade, and you get back a slab with no explanation whatsoever of why your card got the grade it did. If you want more info, you have to pay for a higher service tier and even then you might only get cursory notes. This opacity is increasingly hard to justify when competitors are providing full digital reports by default. PSA’s stance seems to be that its decades of experience are the secret sauce – that their graders’ judgment cannot be fully replicated by a machine. It’s a defensible position given their success, but also a conveniently self-serving one. After all, if the emperor has ruled for this long, why acknowledge any need for a new way of doing things?

However, cracks (no pun intended) are showing in the facade. The hobby has not forgotten the controversies where human graders slipped up – like the scandal a few years ago where altered cards (trimmed or recolored) managed to get past graders and into PSA slabs, rocking the trust in the system. Those incidents suggest that even the best experts can be duped or make errors that a well-trained AI might catch via pattern recognition or measurement consistency. PSA has since leaned on technology more for fraud detection (Genamint’s ability to spot surface changes or match a card to a known altered copy is likely in play), which is commendable. But when it comes to the routine task of assigning grades, PSA still largely keeps that as an art, not a science.

To be fair, PSA (and rivals like Beckett and SGC) will argue that their human-led approach ensures a holistic assessment of each card. A grader might overlook one tiny print dot if the card is otherwise exceptional, using a bit of reasonable discretion, whereas an algorithm might deduct points rigidly. They might also argue that collectors themselves aren’t ready to accept a purely AI-driven grade, especially for high-end vintage where subtle qualities matter. There’s truth in the notion that the hobby’s premium prices often rely on perceived credibility – and right now, PSA’s brand carries more credibility than a newcomer robot grader in the eyes of many auction bidders. Thus, PSA can claim that by sticking to (and refining) their human grading process, they’re actually protecting the market’s trust and the value of everyone’s collections. In short: if it ain’t broke (for them), why fix it?

The Case for Change: Consistency, Transparency, Trust

Despite PSA’s dominance, the case for an AI-driven shakeup in grading grows stronger by the day. The hobby would benefit enormously from grading that is consistent, repeatable, and explainable. Imagine a world where you could submit the same card to a grading service twice and get the exact same grade, with a report detailing the precise reasons. That consistency would remove the agonizing second-guessing (“Should I crack this 9 and try again?”) and refocus everyone on the card itself rather than the grading lottery. It would also level the playing field for collectors – no more wondering if a competitor got a PSA 10 because they’re a bulk dealer who “knows a guy” or just got lucky with a lenient grader. Every card, every time, held to the same standard.

Transparency is another huge win. It’s 2025 – why are we still largely in the dark about why a card got a 8 vs a 9? With AI grading, detailed digital grading reports are a natural output. Companies like TAG and Zeagley are already providing these: high-res imagery with circles or arrows pointing out each flaw, sub-scores for each category, and even interactive web views to zoom in on problem areas. Not only do these reports educate collectors on what to look for, they also keep the grading company honest. If the report says your card’s surface got an 8.5/10 due to a scratch and you, the collector, don’t see any scratch, you’d have grounds to question that grade immediately. In the current system, good luck – PSA simply doesn’t answer those questions beyond generic responses. Transparency would greatly increase trust in grading, ironically the very thing PSA prides itself on. It’s telling that one of TAG’s slogans is creating “transparency, accuracy, and consistency for every card graded.” Those principles are exactly what collectors have been craving.

Then there’s the benefit of speed and efficiency. AI grading systems can process cards much faster than humans. A machine can work 24/7, doesn’t need coffee breaks, and can ramp up throughput just by adding servers or scanners (whereas PSA had to physically expand to a new 130,000 sq ft facility and hire dozens of new graders to increase capacity ). Faster grading means shorter turnaround times and fewer backlogs. During the pandemic, we saw how a huge backlog can virtually paralyze the hobby’s lower end – people stopped sending cheaper cards because they might not see them back for a year. If AI were fully deployed, the concept of a months-long queue could vanish. Companies like AGS brag about “grading 10,000 cards in a day” with automation; even if that’s optimistic, there’s no doubt an algorithm can scale far beyond what manual grading ever could.

Lastly, consider cost. A more efficient grading process should eventually reduce costs for both the company and the consumer. Some of the new AI graders are already undercutting on price – e.g. Zeagley offering grading at $9.99 a card for a 15-day service – whereas PSA’s list price for its economy tier floats around $19–$25 (and much more for high-value or faster service). Granted, PSA has the brand power to charge a premium, but in a competitive market a fully automated solution should be cheaper to operate per card. That savings can be passed on, which encourages more participation in grading across all value levels.

The ChatGPT Experiment: DIY Grading with AI

Perhaps the clearest proof that card grading is ripe for automation is that even hobbyists at home can now leverage AI to grade their cards in a crude way. Incredibly, thanks to advances in AI like OpenAI’s ChatGPT, a collector can snap high-resolution photos of a card (front and back), feed them into an AI model, and ask for a grading opinion. Some early adopters have done just that. One collector shared that he’s “been using ChatGPT to help hypothetically grade cards” – he uploads pictures and asks, “How does the centering look? What might this card grade on PSA’s scale?” The result? “Since I’ve started doing this, I have not received a grade lower than a 9” on the cards he chose to submit. In other words, the AI’s assessment lined up with PSA’s outcomes well enough that it saved him from sending in any card that would grade less than mint. It’s a crude use of a general AI chatbot, yet it highlights something powerful: even consumer AI can approximate grading if given the standards and some images.

Right now, examples like this are more curiosities than commonplace. Very few collectors are actually using ChatGPT or similar tools to pre-grade on a regular basis. But it’s eye-opening that it’s even possible. As image recognition AI improves and becomes more accessible, one can imagine a near-future app where you scan your card with your phone and get an instantaneous grade estimate, complete with highlighted flaws. In fact, some apps and APIs already claim to do this for pre-grading purposes. It’s not hard to imagine a scenario where collectors start publicly verifying or challenging grades using independent AI tools – “Look, here’s what an unbiased AI thinks of my card versus what PSA gave it.” If those two views diverge often enough, it could pressure grading companies to be more transparent or consistent. At the very least, it empowers collectors with more information about their own cards’ condition.

Embracing the Future: It’s Time for Change

The sports card grading industry finds itself at a crossroads between tradition and technology. PSA is king – and by many metrics, doing better than ever in terms of business – but that doesn’t mean the system is perfect or cannot be improved. Relying purely on human judgment in 2025, when AI vision systems are extraordinarily capable, feels increasingly antiquated. The hobby deserves grading that is as precise and passion-driven as the collectors themselves. Adopting AI for consistent and repeatable standards should be an easy call: it would eliminate so many pain points (inconsistency, long waits, lack of feedback) that collectors grumble about today.

Implementing AI doesn’t have to mean ousting the human experts entirely. A hybrid model could offer the best of both worlds – AI for objectivity and humans for oversight. For example, AI could handle the initial inspection, quantifying centering to the decimal and finding every tiny scratch, then a human grader could review the findings, handle any truly subjective nuances (like eye appeal or print quality issues that aren’t easily quantified), and confirm the final grade. The human becomes more of a quality control manager rather than the sole arbiter. This would massively speed up the process and tighten consistency, while still keeping a human in the loop to satisfy those who want that assurance. Over time, as the AI’s track record builds trust, the balance could shift further toward full automation.

Ultimately, the adoption of AI in grading is not about devaluing human expertise – it’s about capturing that expertise in a reproducible way. The best graders have an eye for detail; the goal of AI is to have 1000 “eyes” for detail and never blink. Consistency is king in any grading or authentication field. Imagine if two different coin grading experts could look at the same coin and one says “MS-65” and the other “MS-67” – coin collectors would be up in arms. And yet, in cards we often tolerate that variability as normal. We shouldn’t. Cards may differ subtly in how they’re produced (vintage cards often have rough cuts that a computer might flag as edge damage, for instance), so it’s important to train the AI on those nuances. But once trained, a machine will apply the standard exactly, every single time. That level of fairness and predictability would enhance the hobby’s integrity.

It might take more time – and perhaps a serious competitive threat – for the giants like PSA to fully embrace an AI-driven model. But the winds of change are blowing. A “technological revolution in grading” is coming; one day we’ll look back and wonder how we ever trusted the old legacy process, as one tech expert quipped. The smarter companies will lead that revolution rather than resist it. Collectors, too, should welcome the change: an AI shakeup would make grading more of a science and less of a gamble. When you submit a card, you should be confident the grade it gets is the grade it deserves, not the grade someone felt like giving it that day. Consistency. Transparency. Objectivity. These shouldn’t be revolutionary concepts, but in the current state of sports card grading, they absolutely are.

The sports card hobby has always been a blend of nostalgia and innovation. We love our cardboard heroes from the past, but we’ve also embraced new-age online marketplaces, digital card breaks, and blockchain authentication. It’s time the critical step of grading catches up, too. Whether through an industry leader finally rolling out true AI grading, or an upstart proving its mettle and forcing change, collectors are poised to benefit. The technology is here, the need is obvious, and the hobby’s future will be brighter when every slabbed card comes with both a grade we can trust and the data to back it up. The sooner we get there, the better for everyone who loves this game.

Humans + Machines: From Co-Pilots to Convergence — A Friendly Response to Josh Caplan’s “Interview with AI”

1. Setting the Table

Josh, I loved how you framed your conversation with ChatGPT-4o around three crisp horizons — 5, 25 and 100 years. It’s a structure that forces us to check our near-term expectations against our speculative impulses. Below I’ll walk through each horizon, point out where my own analysis aligns or diverges, and defend those positions with the latest data and research.

2. Horizon #1 (≈ 2025-2030): The Co-Pilot Decade

Where we agree

You write that “AI will write drafts, summarize meetings, and surface insights … accelerating workflows without replacing human judgment.” Reality is already catching up:

A May 2025 survey of 645 engineers found 90 % of teams are now using AI tools, up from 61 % a year earlier; 62 % report at least a 25 % productivity boost.

Early enterprise roll-outs of Microsoft 365 Copilot show time savings of 30–60 minutes per user per day and cycle-time cuts on multi-week processes down to 24 hours.

These numbers vindicate your “co-pilot” metaphor: narrow-scope models already augment search, summarization and code, freeing humans for higher-order decisions.

Where I’m less sanguine

The same studies point to integration debt: leaders underestimate the cost of securing data pipes, redesigning workflows and upskilling middle management to interpret AI output. Until those invisible costs are budgeted up-front, the productivity bump you forecast could flatten.

3. Horizon #2 (≈ 2050): Partners in Intelligence

Your claim: By 2050 the line between “tool” and “partner” blurs; humans focus on ethics, empathy and strategy while AI scales logic and repetition.

Supportive evidence

A June 2025 research agenda on AI-first systems argues that autonomous agents will run end-to-end workflows, with humans “supervising, strategizing and acting as ethical stewards.” The architecture is plausible: agentic stacks, retrieval-augmented memory, and multimodal grounding already exist in prototype.

The labour market caveat

The World Economic Forum’s Future of Jobs 2025 projects 170 million new jobs and 92 million displaced by 2030, for a net gain of 78 million — but also warns that 59 % of current workers will need reskilling. That tension fuels today’s “Jensen-vs-Dario” debate: Nvidia’s Jensen Huang insists “there will be more jobs,” while Anthropic’s Dario Amodei fears a white-collar bloodbath that could wipe out half of entry-level roles.

My take: both can be right. Technology will spawn new roles, but only if public- and private-sector reskilling keeps pace with task-level disruption. Without that, we risk a bifurcated workforce of AI super-users and those perpetually catching up.

4. Horizon #3 (≈ 2125): Symbiosis or Overreach?

You envision brain-computer interfaces (BCIs) and digital memory extensions leading to shared intelligence. The trajectory isn’t science fiction anymore:

Neuralink began human clinical trials in June 2025 and already has five paralyzed patients controlling devices by thought.

Scholarly work now focuses less on raw feasibility than on regulating autonomy, mental privacy and identity in next-generation BCIs.

Where caution is warranted

Hardware failure rates, thread migration in neural tissue, and software-mediated hallucinations all remain unsolved. The moral of the story: physical symbiosis will arrive in layers — therapeutic first, augmentative later — and only under robust oversight.

5. Managing the Transition

6. Closing Thoughts

Josh, your optimism is infectious and, on balance, justified. My friendly amendments are less about dampening that optimism than grounding it in empirics:

Co-pilots already work — but require invisible plumbing and new managerial skills. Partners by 2050 are plausible, provided reskilling outpaces displacement. Symbiosis is a centuries-long marathon, and the ethical scaffolding must be built now.

If we treat literacy, upskilling and governance as first-class engineering problems — not afterthoughts — the future you describe can emerge by design rather than by accident. I look forward to your rebuttal over coffee, human or virtual.

— Paginated Report Bear and ChatGPT o3

🎩 Retire Your Top Hat: Why It’s Time to Say Goodbye to “Whilst”

There’s a word haunting documents, cluttering up chat messages, and lurking in email threads like an uninvited character from Downton Abbey. That word is whilst.

Let’s be clear: no one in the United States says this unironically. Not in conversation. Not in writing. Not in corporate life. Not unless they’re also saying “fortnight,” “bespoke,” or “I daresay.”

It’s Not Just Archaic—It’s Distracting

In American English, whilst is the verbal equivalent of someone casually pulling out a monocle in a team meeting. It grabs attention—but not the kind you want. It doesn’t make you sound smart, elegant, or refined. It makes your writing sound like it’s cosplaying as a 19th-century butler.

It’s the verbal “smell of mahogany and pipe tobacco”—which is great for a Sherlock Holmes novel. Less so for a Q3 strategy deck.

“But It’s Just a Synonym for While…”

Not really. In British English, whilst has some niche usage as a slightly more formal or literary variant of while. But in American English, it feels affected. Obsolete. Weird. According to Bryan Garner, the go-to authority on usage, it’s “virtually obsolete” in American English.

Even The Guardian—a proudly British publication—says:

“while, not whilst.”
If they don’t want it, why should we?

The Data Doesn’t Lie

A quick glance at any American English corpus tells the story:
while appears hundreds of times more often than whilst.
You are more likely to encounter the word defenestrate in a U.S. context than whilst. (And that’s saying something.)

When You Use “Whilst” in American Writing, Here’s What Happens:

Your reader pauses, just long enough to think, “Wait, what?”
The tone of your writing shifts from clear and modern to weirdly antique.
Your credibility takes a micro-dip, especially if you’re talking about anything tech, product, UX, or business-related.

If your aim is clarity, fluency, and modern tone, whilst is working against you. Every. Single. Time.

So Why Are People Still Using It?

Sometimes it’s unintentional—picked up from reading British content or working with UK colleagues. Fair. But often it’s performative. A subtle “look how elevated my writing is.” Spoiler: it’s not.

Here’s a Radical Idea: Use “While”

It’s simple.
It’s modern.
It’s not pretending it’s writing for The Times in 1852.

Final Verdict

Unless you are:

A Dickensian character,
Writing fanfiction set in Edwardian England,
Or legally required by the BBC,

please—for the love of plain language—stop using whilst.

Say while. Your readers will thank you. Your teammates will stop rolling their eyes. And your copy will immediately gain 200% more credibility in the modern world.

This blog post was created with help from ChatGPT to combat the “whilst” crowd at my office

The Rise and Heartbreak of Antonio McDyess: A Superstar’s Path Cut Short

Note: Antonio McDyess is one of my favorite players that no one I know seems to know or remember, so I asked ChatGPT Deep Research to help tell the story of his rise to the cusp of superstardom. Do a YouTube search for McDyess highlights – it’s a blast.

Humble Beginnings and Early Promise

Antonio McDyess hailed from small-town Quitman, Mississippi, and quickly made a name for himself on the basketball court. After starring at the University of Alabama – where he led the Crimson Tide in both scoring and rebounding as a sophomore – McDyess entered the star-studded 1995 NBA Draft . He was selected second overall in that draft (one of the deepest of the 90s) and immediately traded from the LA Clippers to the Denver Nuggets in a draft-night deal . To put that in perspective, the only player taken ahead of him was Joe Smith, and McDyess’s draft class included future luminaries like Jerry Stackhouse, Rasheed Wallace, and high-school phenom Kevin Garnett . From day one, it was clear Denver had landed a budding star.

McDyess wasted little time in validating the hype. As a rookie in 1995-96, the 6’9” forward (affectionately nicknamed “Dice”) earned All-Rookie First Team honors , immediately showcasing his talent on a struggling Nuggets squad. By his second season, despite Denver’s woes, McDyess was averaging 18.3 points and 7.3 rebounds per game , often the lone bright spot on a team that won just 21 games. His blend of size, explosive athleticism, and effort made him a fan favorite. Nuggets supporters could “see the future through McDyess” and believed it could only get better . He was the franchise’s great hope – a humble, hardworking Southern kid with sky-high potential – and he carried those expectations with quiet determination.

High-Flying Star on the Rise

McDyess’s game was pure electricity. He was an elite leaper who seemed to play above the rim on every possession, throwing down thunderous dunks that brought crowds to their feet . In fact, it took only a few preseason games for observers to start comparing him to a young Shawn Kemp – except with a better jump shot . That was the kind of rarefied talent McDyess possessed: the power and ferocity of a dunk-contest legend, combined with a soft mid-range touch that made him a matchup nightmare. “He’s showing the talent and skills that made him a premier player,” Suns GM Bryan Colangelo raved during McDyess’s early career, “There’s so much upside to his game that he can only get better.”

After two productive seasons in Denver, McDyess was traded to the Phoenix Suns in 1997, and there his star continued to ascend. Teaming with an elite point guard in Jason Kidd, the 23-year-old McDyess thrived. He averaged 15.1 points (on a phenomenal 53.6% shooting) along with 7.6 rebounds in 1997-98, and he only improved as the season went on . With “Dice” patrolling the paint and finishing fast breaks, the Suns won 56 games that year – a remarkable turnaround that had fans in Phoenix dreaming of a new era. McDyess was wildly athletic and electric, the perfect running mate for Kidd in an up-tempo offense . At just 23, he was already being looked at as a future superstar who could carry a franchise.

That rising-star status was cemented during the summer of 1998. McDyess became one of the hottest targets in free agency, courted by multiple teams despite the NBA’s lockout delaying the offseason. In a now-legendary saga, McDyess initially agreed to return to Denver, but had second thoughts when Phoenix pushed to re-sign him. The situation turned into something of a sports soap opera: Jason Kidd and two Suns teammates actually chartered a plane and flew through a blizzard to Denver in a last-ditch effort to persuade McDyess to stay in Phoenix . (They were so desperate to keep him that they literally showed up at McNichols Arena in the snow!) Nuggets management caught wind of this and made sure Kidd’s crew never got to meet with McDyess – even enlisting hockey legend Patrick Roy to charm the young forward with a signed goalie stick . In the end, McDyess decided to stick with Denver, a testament to how much the franchise – and its city – meant to him. The entire episode, however, underscored a key point: McDyess was so coveted that All-Star players were willing to move heaven and earth to recruit him.

Back in Denver for the lockout-shortened 1999 season, McDyess validated all that frenzy by erupting with the best basketball of his life. Freed to be the focal point, he posted a jaw-dropping 21.2 points and 10.7 rebounds per game that year . To put that in context, he became one of only three Nuggets players in history to average 20+ points and 10+ rebounds over a season (joining franchise legends Dan Issel and George McGinnis) . At just 24 years old, McDyess earned All-NBA Third Team honors in 1999 , officially marking him as one of the league’s elite forwards. He was no longer just “promising” – he was arriving. Denver fans, long starved for success, finally had a young cornerstone to rally around. As one local writer later remembered, “McDyess was giving Nuggets fans hope for the future” during those late ’90s seasons. Every night brought a new display of his blossoming skill: a high-flying alley-oop slam, a soaring rebound in traffic, a fast-break finish punctuated by a rim-rattling dunk. The NBA took notice that this humble kid from Mississippi had become a nightly double-double machine and a highlight waiting to happen.

Peak of His Powers

By the 2000-01 season, Antonio McDyess was widely regarded as one of the best power forwards in the game. In an era stacked with superstar big men – Tim Duncan, Kevin Garnett, Chris Webber, and others – McDyess had firmly earned his place in that conversation. He led the Nuggets with 20.8 points and 12.1 rebounds per game in 2000-01 , becoming just the third Denver player ever to average 20-and-10 for a full season . That year he was rewarded with his first and only NBA All-Star selection , a recognition that Nuggets fans felt was overdue. On a national stage, the 26-year-old McDyess rubbed shoulders with the league’s greats, validating that he truly belonged among them.

Beyond the numbers, what made McDyess special was how he played the game. He was an “old-school” power forward with new-age athleticism. One moment he’d muscle through a defender in the post for a put-back dunk; the next he’d step out and coolly knock down a 15-foot jumper. On defense, he held his own as well – blocking shots, controlling the glass, and using his quickness to guard multiple positions. In fact, McDyess was selected to represent the United States in the 2000 Sydney Olympics, where he earned a gold medal and even hit a game-winner during the tournament . Winning Olympic gold was both a personal triumph and another affirmation that he was among basketball’s elite. As the 2000-01 NBA season went on, McDyess seemed to put it all together. He notched monster stat lines – including a career-high 46 points and 19 rebounds in one game – and routinely carried a middling Nuggets squad on his back. The team finished 40-42, their best record in six years , and while they narrowly missed the playoffs, the arrow was pointing straight up. It was easy to imagine Denver building a contender around their star forward. Antonio McDyess was on the path to superstardom, and everyone knew it.

By this point, even casual fans could recognize McDyess’s name. He wasn’t flashy off the court – a quiet, humble worker rather than a self-promoter – but on the court he was downright spectacular. Longtime Nuggets followers will tell you how McDyess’s presence made even the dark days of the late ’90s bearable. He gave them hope. As one writer later lamented, “The joy he brought Denver fans through the tough, lean ’90s was immeasurable.” In McDyess, the Nuggets saw a centerpiece to build around for the next decade. He was just entering his prime, continuing to refine his skills to match his athletic gifts, and carrying himself with a quiet confidence that inspired those around him. It truly felt like nothing could stop him.

A Cruel Twist of Fate

But sometimes in sports, fate intervenes in the unkindest way. For Antonio McDyess, that moment came just as he reached his peak. Late in the 2000-01 season – after he had been playing some of the best basketball of his life – McDyess suffered a painful knee injury, a partially dislocated kneecap . He tried to come back healthy for the next year, but the worst was yet to come. Early in the 2001-02 season, only about ten games in, disaster struck: McDyess ruptured his patellar tendon in his left knee, the kind of devastating injury that can end careers in an instant . He underwent surgery and was ruled out for the entire season . In fact, that one injury wiped away effectively two years of his prime – McDyess would miss all of 2001-02 and all of 2002-03, watching helplessly from the sidelines as the promising trajectory of his career was violently ripped away .

It’s hard to overstate just how heartbreaking this turn of events was. One month, McDyess was on top of the world – an All-Star, the face of a franchise, seemingly invincible when he took flight for a dunk. The next, he was facing the reality that he might never be the same player again. As Denver Stiffs painfully summarized, “Oh what could have been. McDyess had the makings of a long-time star in this league until a freak injury happened.” In fact, that knee injury was so catastrophic that it effectively ended not only McDyess’s superstar run but also played a part in ending coach Dan Issel’s tenure (Issel resigned amid the team’s struggles shortly after) . The basketball gods, it seemed, can be unbearably cruel.

For Nuggets fans – and NBA fans in general – McDyess’s injury was the kind of story that just breaks your heart. In the years that followed, McDyess valiantly attempted to come back. He was traded to the New York Knicks in 2002 as part of a blockbuster deal, only to re-injure the same knee in a freak accident (landing from a dunk in a preseason game) before he could ever really get started in New York . He eventually found a second life as a role player: after a brief return to Phoenix, McDyess signed with the Detroit Pistons and reinvented his game to compensate for his diminished athleticism . Instead of soaring above the rim every night, he became a savvy mid-range shooter and a reliable veteran presence, helping Detroit reach the NBA Finals in 2005.

McDyess later reinvented himself as a reliable mid-range shooter and veteran leader – a testament to his determination – but the explosive athleticism of his youth was never fully regained.

Watching McDyess in those later years was bittersweet. He was still a good player – even showing flashes of the old “Dice” brilliance on occasion – but we could only catch glimpses of what he once was . The once-explosive leaper now played below the rim, leaning on skill and experience rather than raw hops. And while he carved out a respectable lengthy career (15 seasons in the NBA) and remained, by all accounts, one of the most humble and beloved guys in the league, the superstar path that he had been on was gone forever. McDyess would never again average more than 9 points a game after his injury , a stark reminder of how swiftly fortune can turn in professional sports.

For many fans, Antonio McDyess became part of a tragic NBA fraternity – the “what if?” club. Just as we later saw with Penny Hardaway (whose Hall-of-Fame trajectory with the Orlando Magic was cut short by knee injuries in the late ’90s) or Derrick Rose (whose MVP ascent was halted by an ACL tear in 2012), McDyess’s story is one of unrealized potential. He was only 26 when his body betrayed him. We are left to imagine how high he might have soared, how many All-Star games he might have played in, or how he might have altered the balance of power in the league had he stayed healthy. Would Denver have built a contender around him? Would “Dice” have joined the pantheon of great power forwards of the 2000s? Those questions will never be answered, but the fact that we ask them at all is a testament to his talent.

In the end, Antonio McDyess’s career is remembered with a mix of admiration and melancholy. Admiration for the beast of a player he was before the injuries, and for the grace with which he handled the adversity that followed. Melancholy for the superstar we never fully got to see. As one longtime fan put it, McDyess was “as nice off the court as he was just plain nasty on the court” – a gentle soul with a ferocious game. He gave everything he had to the sport, and even when fate dealt him a cruel hand, he never lost his love for the game or his humility.

For younger or newer basketball fans who may not know his name, Antonio McDyess’s story serves as both an inspiration and a cautionary tale. At his peak, he was magnificent – a player with all the tools to be a perennial All-Star, a near-superstar whose every game was worth watching. And yet, he’s also a reminder of how fragile athletic greatness can be. One moment you’re flying high above the rim, the next moment it’s all gone. McDyess once brought limitless hope to a franchise and its fans, and though his journey took a heartbreaking turn, his early brilliance will never be forgotten.

In the echoes of those who saw him play, you’ll still hear it: Oh, what could have been . But let’s also remember what truly was – an extraordinary talent who, for a few shining years, gave us a glimpse of basketball heaven. Antonio McDyess was a star that burned bright, if only too briefly, and his rise and fall remain one of the NBA’s most poignant tales.

Sources:

Microsoft Fabric Capacity Management: A Comprehensive Guide for Administrators (using ChatGPT’s Deep Research)

Author’s note – I have enjoyed playing around with the Deep Research capabilities of ChatGPT, and I had it put together what it felt was the definitive whitepaper on Capacity Management for Microsoft Fabric. It basically just used the Microsoft documentation (plus a couple of community posts) to pull it together, so I’m curious what you think. I’ll leave a link to download the PDF copy of this at the end of the post.

Executive Summary

Microsoft Fabric capacities provide the foundational compute resources that power the Fabric analytics platform. They are essentially dedicated pools of compute (measured in Capacity Units or CUs) allocated to an organization’s Microsoft Fabric tenant. Proper capacity management is crucial for ensuring reliable performance, supporting all Fabric workloads (Power BI, Data Engineering, Data Science, Real-Time Analytics, etc.), and optimizing costs. This white paper introduces capacity and tenant administrators to the full spectrum of Fabric capacity management – from basic concepts to advanced strategies.

Key takeaways: Fabric offers multiple capacity SKUs (F, P, A, EM, Trial) with differing capabilities and licensing models. Understanding these SKU types and how to provision them is the first step. Once a capacity is in place, administrators must plan and size it appropriately to meet workload demands without over-provisioning. All Fabric experiences share capacity resources, so effective workload management and governance are needed to prevent any one workload from overwhelming others. Fabric’s capacity model introduces bursting and smoothing to handle short-term peaks, while throttling mechanisms protect the system during sustained overloads. Tools like the Fabric Capacity Metrics App provide visibility into utilization and help with monitoring performance and identifying bottlenecks. Administrators should leverage features such as autoscale options (manual or scripted scaling and Spark auto-scaling), notifications, and the new surge protection to manage peak loads and maintain service levels.

Effective capacity management also involves governance practices: assigning workspaces to capacities in a thoughtful way, isolating critical workloads, and controlling who can create or consume capacity resources. Cost optimization is a continuous concern – this paper discusses strategies like pausing capacities during idle periods, choosing the right SKU size (and switching to reserved pricing for savings), and using per-user licensing (Premium Per User) when appropriate to minimize costs. Finally, we present real-world scenarios with recommendations to illustrate how organizations can mix and match these approaches. By following the guidance in this document, new administrators will be equipped to manage Microsoft Fabric capacities confidently and get the most value from their analytics investment.

Introduction to Microsoft Fabric Capacities

Microsoft Fabric is a unified analytics platform that spans data integration, data engineering, data warehousing, data science, real-time analytics, and business intelligence (Power BI). A Microsoft Fabric capacity is a dedicated set of cloud resources (compute memory/CPU) allocated to a tenant to run these analytics workloads. In essence, a capacity represents a chunk of “always-on” compute power measured in Capacity Units (CUs) that your organization owns or subscribes to. The capacity’s size (number of CUs) determines how much computational load it can handle at any given time.

Why capacities matter: Certain Fabric features and collaborative capabilities are only available when content is hosted in a capacity. For example, to share Power BI reports broadly without requiring per-user licenses, or to use advanced Fabric services like Spark notebooks, data warehouses, and real-time analytics, you must use a Fabric capacity. Capacities enable organization-wide sharing, collaboration, and performance guarantees beyond the limits of individual workstations or ad-hoc cloud resources. They act as containers for workspaces – any workspace assigned to a capacity will run all its workload (reports, datasets, pipelines, notebooks, etc.) on that capacity’s resources. This provides predictable performance and isolation: one team’s heavy data science experiment in their capacity won’t consume resources needed by another team’s dashboards on a different capacity. It also simplifies administration – instead of managing separate compute for each project, admins manage pools of capacity that can host many projects.

In summary, Fabric capacities are the backbone of a Fabric deployment, combining compute isolation, performance scaling, and licensing benefits. With a capacity, your organization can create and share Fabric content (from Power BI reports to AI models) with the assurance of dedicated resources and without every user needing a premium license. The rest of this document will explore how to choose the right capacity, configure it for various workloads, keep it running optimally, and do so cost-effectively.

Capacity SKU Types and Differences (F, P, A, EM, Trial)

Microsoft Fabric builds on the legacy of Power BI’s capacity-based licensing, introducing new Fabric (F) SKUs alongside existing Premium (P) and Embedded SKUs. It’s important for admins to understand the types of capacity SKUs available and their differences:

F-SKUs (Fabric SKUs): These are the new* capacity units introduced with Microsoft Fabric. They are purchased through Azure and measured in Capacity Units (CUs). F-SKUs range from small to very large (F2 up to F2048), each providing a set number of CUs (e.g. F2 = 2 CUs, F64 = 64 CUs, etc.). F-SKUs support all Fabric workloads (Power BI content and the new Fabric experiences like Lakehouse, Warehouse, Spark, etc.). They offer flexible cloud purchasing (hourly pay-as-you-go billing with the ability to pause when not in use) and scaling options. Microsoft is encouraging customers to adopt F-SKUs for Fabric due to their flexibility in scaling and billing.
P-SKUs (Power BI Premium per Capacity): These were the traditional Power BI Premium capacities (P1 through P5) bought via the Microsoft 365 admin center with an annual subscription commitment. P-SKUs also support the full Fabric feature set (they have been migrated onto the Fabric backend). However, as of mid-2024, Microsoft has deprecated new purchases of P-SKUs in favor of F-SKUs. Organizations with existing P capacities can use Fabric on them, but new capacity purchases should be F-SKUs going forward. One distinction is that P-SKUs cannot be paused and were billed as fixed annual licenses (less flexible, but previously lower cost for constant use).
A-SKUs (Azure Power BI Embedded): These are Azure-purchased capacities originally meant for Power BI embedded analytics scenarios. They correspond to the same resource levels as some F-SKUs (for example, A4 is equivalent to an F64 in compute power) but only support Power BI workloads – they do not support the new Fabric experiences like Spark or data engineering. A-SKUs can still be used if you only need Power BI (for example, for embedding reports in a web app), but if any Fabric features are needed, you must use an F or P SKU.
EM-SKUs (Power BI Embedded for organization): Another variant of embedded capacity (EM1, EM2, EM3) which are lower-tier and were used for internal “Embedded” scenarios (like embedding Power BI content in SharePoint or Teams without full Premium). Like A-SKUs, EM SKUs are limited to Power BI content only and correspond to smaller capacity sizes (EM3 ~ F32). They cannot run Fabric workloads.
Trial SKU: Microsoft Fabric offers a free trial capacity to let organizations try Fabric for a limited time. The trial capacity provides 64 CUs (equivalent to an F64 SKU) and supports all Fabric features, but lasts for 60 days. This is a fixed-size capacity (roughly equal to a P1 in power) that can be activated without cost. It’s ideal for initial evaluations and proof-of-concept work. After 60 days, the trial expires (though Microsoft has allowed extensions in some cases). Administrators cannot change the size of a trial capacity – it’s pre-set – and there may be limits on the number of trials per tenant.

The table below summarizes the Fabric SKU sizes and their approximate equivalence to Power BI Premium for context:

SKU	Capacity Units (CUs)	Equivalent P-SKU / A-SKU	Power BI v-cores
F2	2 CUs	(no P-SKU; smallest)	0.25 v-core
F4	4 CUs	(no P-SKU)	0.5 v-core
F8	8 CUs	EM1 / A1	1 v-core
F16	16 CUs	EM2 / A2	2 v-cores
F32	32 CUs	EM3 / A3	4 v-cores
F64	64 CUs	P1 / A4	8 v-cores
Trial	64 CUs	(no P-SKU; free trial)	8 v-cores
F128	128 CUs	P2 / A5	16 v-cores
F256	256 CUs	P3 / A6	32 v-cores
F512	512 CUs	P4 / A7	64 v-cores
F1024	1024 CUs	P5 / A8	128 v-cores
F2048	2048 CUs	(no direct P-SKU)	256 v-cores

Table: Fabric capacity SKU sizes in Capacity Units (CU) with equivalent legacy SKUs. Note: P-SKUs P1–P5 correspond to F64–F1024. A-SKUs and EM-SKUs only support Power BI content and roughly map to F8–F32 sizes.

In practical terms, F64 (64 CU) is the threshold where a capacity is considered “Premium” in the Power BI sense – it has the same 8 v-cores as a P1. Indeed, content in workspaces on an F64 or larger can be consumed by viewers with a free Fabric license (no Pro license needed). By contrast, the smaller F2–F32 capacities, while useful for light workloads or development, do not remove the need for Power BI Pro licenses for content consumers. Administrators should be aware of this distinction: if your goal is to enable broad internal report sharing to free users, you will need at least an F64 capacity.

To recap SKU differences: F-SKUs are the modern, Azure-based Fabric capacities that cover all workloads and offer flexibility (pause/resume, hourly billing). P-SKUs (legacy Premium) also cover all workloads but are being phased out for new purchases, and they require an annual subscription (though existing ones can continue to be used for Fabric). A/EM SKUs are limited to Power BI content only and primarily used for embedding scenarios; they might still be relevant if your organization only cares about Power BI and wants a smaller or cost-specific option. And the trial capacity is a temporary F64 equivalent provided free for evaluation purposes.

Licensing and Provisioning

Before you can use a Fabric capacity, you must license and provision it for your tenant. This involves understanding how to acquire the capacity (through Azure or Microsoft 365), what user licenses are needed, and how to set up the capacity in the admin portal.

Purchasing a capacity: For F-SKUs and A/EM SKUs, capacities are purchased via an Azure subscription. You (or your Azure admin) will create a Microsoft Fabric capacity resource in Azure, selecting the SKU size (e.g. F64) and region. The capacity resource is billed to your Azure account. For P-SKUs (if you already have one), they were purchased through the Microsoft 365 admin center (as a SaaS license commitment). As noted, new P-SKU purchases are no longer available after July 2024. If you have existing P capacities, they will show up in the Fabric admin portal automatically. Otherwise, new capacity needs will be fulfilled by creating F-SKUs in Azure.

Provisioning and setup: Once purchased, the capacity must be provisioned in your Fabric tenant. For Azure-based capacities (F, A, EM), this happens automatically when you create the resource – you will see the new capacity listed in the Fabric Admin Portal under Capacity settings. You need to be a Fabric admin or capacity admin to access this. In the Fabric Admin Portal (accessible via the gear icon in the Fabric UI), under Capacity Settings, you will find tabs for Power BI Premium, Power BI Embedded, Fabric capacity, and Trial. Your capacity will appear in the appropriate section (e.g., an F-SKU under “Fabric capacity”). From there, you can manage its settings (more on that later) and assign workspaces to it.

When creating an F capacity in Azure, you will choose a region (datacenter location) for the capacity. This determines where the compute resources live and typically where the data for Fabric items in that capacity is stored. For example, if you create an F64 in West Europe, a Fabric Warehouse or Lakehouse created in a workspace on that capacity will reside in West Europe region (useful for data residency requirements). Organizations with global presence might provision capacities in multiple regions to keep data and computation local to users or comply with regulations.

Per-user licensing requirements: Even with capacities, Microsoft Fabric uses a mix of capacity licensing and per-user licenses:

Every user who authors content or needs access to Power BI features beyond viewing must have a Power BI Pro license (or Premium Per User) unless the content is in a capacity that allows free-user access. In Fabric, a Free user license lets you create and use non-Power BI Fabric items (like Lakehouses, notebooks, etc.) in a capacity workspace, but it does not allow creating standard Power BI content in shared workspaces or sharing those with others. To publish Power BI reports to a workspace (other than your personal My Workspace) and share them, you still need a Pro license or PPU. Essentially, capacity removes license requirements for viewing content (if the capacity is sufficiently large), but content creators typically need Pro/PPU licenses for Power BI work.
For viewers of content: If the workspace is on a capacity smaller than F64, all viewers need Pro licenses as if it were a normal shared workspace. If the workspace is on an F64 or larger capacity (or a P-SKU capacity), then free licensed users can view the content (they just need the basic Fabric free license and viewer role). This is analogous to Power BI Premium capacity behavior. So an admin must plan license needs accordingly – for true wide audience distribution, ensure the capacity is at least F64, otherwise you won’t realize the “free user view” benefit.
Premium Per User (PPU): PPU is a per-user licensing option that provides most Premium features to individual users on shared capacity. While not a capacity, it’s relevant in capacity planning: if you have a small number of users that need premium features, PPU can be more cost-effective than buying a whole capacity. Microsoft suggests considering PPU if fewer than ~250 users need Premium capabilities. For example, rather than an F64 which supports unlimited users, 50 users could each get PPU licenses. However, PPU does not support the broader Fabric workloads (it’s mainly a Power BI feature set license), so if you want the Fabric engineering/science features, you need a capacity.

In summary, to get started you will purchase or activate a capacity and ensure you have at least one user with a Pro (or PPU) license to administer it and publish Power BI content. Many organizations begin with the Fabric trial capacity – any user with admin rights can initiate the trial from the Fabric portal, which creates the 60-day F64 capacity for the tenant. During the trial period, you might allow multiple users to experiment on that capacity. Once ready to move to production, you would purchase an F-SKU of appropriate size. Keep in mind that a trial capacity is time-bound and also fixed in size (you cannot scale a trial up or down). So after gauging usage in trial, you’ll choose a permanent SKU.

Capacity Planning and Sizing Guidance

Choosing the right capacity size is a critical early decision. Capacity planning is the process of estimating how many CUs (or what SKU tier) you need to run your workloads smoothly, both now and in the future. The goal is to avoid performance problems like slow queries or job failures due to insufficient resources, while also not over-paying for idle capacity. This section provides guidance on sizing a capacity and adjusting it as usage evolves.

Understand your workloads and users: Start by profiling the types of workloads and usage patterns you expect on the capacity. Key factors include:

Data volume and complexity: Large data models (e.g. huge Power BI datasets) or heavy ETL processes (like frequent dataflows or Spark jobs) will consume more compute and memory. If you plan to refresh terabyte-scale datasets or run complex transformations daily, size up accordingly.
Concurrent users and activities: Power BI workloads with many simultaneous report users or queries (or heavy embedded analytics usage) can drive up CPU and memory usage quickly. A capacity serving 200 concurrent dashboard users needs more CUs than one serving 20 users. Concurrency in Spark jobs or SQL queries similarly affects load.
Real-time or continuous processing: If you have real-time analytics (such as continuous event ingestion, KQL databases for IoT telemetry, or streaming datasets), your capacity will see constant usage rather than brief spikes. Ongoing processes mean you need enough capacity to sustain a baseline of usage 24/7.
Advanced analytics and data science: Machine learning model training or large-scale data science experiments can be very computationally intensive (high CPU for extended periods). A few data scientists running complex notebooks might consume more CUs than dozens of basic report users. Also consider if they will run jobs concurrently.
Number of users/roles: The more users with access, the greater the chance of overlapping activities. A company with 200 Power BI users running reports will likely require more capacity than one with 10 engineers doing data transformations. Even if each individual task isn’t huge, many small tasks add up.

By evaluating these factors, you can get a rough sense of whether you need a small (F2–F16), medium (F32–F64), or large (F128+) capacity.

Start with data and tools: Microsoft recommends a data-driven approach to capacity sizing. One strategy is to begin with a trial capacity or a small pay-as-you-go capacity, run your actual workloads, and measure the utilization. The Fabric Capacity Metrics App can be installed to monitor CPU utilization, memory, etc., and identify peaks. Over a representative period (say a busy week), observe how much of the 64 CU trial is used. If you find that utilization is peaking near 100% and throttling occurs, you likely need a larger SKU. If usage stays low (e.g. under 30% most of the time), you might get by with a smaller SKU in production or keep the same size with headroom.

Microsoft provides guidance to “start small and then gradually increase the size as necessary.” It’s often best to begin with a smaller capacity, see how it performs, and scale up if you approach limits. This avoids overcommitting to an expensive capacity that you might not fully use. With Fabric’s flexibility, scaling up (or down) capacity is relatively easy through Azure, and short-term overuse can be mitigated by bursting (discussed later).

Concretely, you would:

Measure consumption – perhaps use an F32 or F64 on a trial or month-to-month basis. Use the metrics app to check the CU utilization over time (Fabric measures consumption in 30-second intervals; multiply CUs by 30 to get CU-seconds per interval). Identify peak times and which workloads are driving them (the metrics app breaks down usage by item type, e.g. dataset vs Spark notebook).
Identify requirements – If your peak 30-second CU use is, say, 1500 CU-seconds, that’s roughly 50 CUs worth of power needed continuously in that peak period (since 30 sec * 50 CU = 1500). That suggests an F64 might be just enough (64 CUs) with some buffer, whereas an F32 (32 CUs) would throttle. On the other hand, if peaks only hit 200 CU-seconds (which is ~7 CUs needed), even an F8 could handle it.
Scale accordingly – Choose the SKU that covers your typical peak. It’s wise to allow some headroom, as constant 100% usage will lead to throttling. For instance, if your trial F64 shows occasional 80% spikes, moving to a permanent F64 could be fine thanks to bursting, but if you often hit 120%+ (bursting into future capacity), you should consider F128 or splitting workloads.

Microsoft has also provided a Fabric Capacity Estimator tool (on the Fabric website) which can help model capacity needs by inputting factors like number of users, dataset sizes, refresh rates, etc. This can be a starting point, but real usage metrics are more reliable.

Planning for growth and variability: Keep in mind future growth – if you expect user counts or data volumes to double in a year, factor that into capacity sizing (you may start at F64 and plan to increase to F128 later). Also consider workload timing. Some capacities experience distinct daily peaks (e.g., heavy ETL jobs at 2 AM, heavy report usage at 9 AM). Thanks to Fabric’s bursting and smoothing, a capacity can handle short peaks above its baseline, but if two peaks overlap or usage grows, you might need a bigger size or to schedule workloads to avoid contention. Where possible, schedule intensive background jobs (data refreshes, scoring runs) during off-peak hours for interactive use, to reduce concurrent strain on the capacity.

In summary, do your homework with a trial or pilot phase, leverage monitoring tools, and err on the side of starting a bit smaller – you can always scale up. Capacity planning helps you choose the right SKU and avoid slow queries or throttling while optimizing spend. And remember, you can have multiple capacities too; sometimes the answer is not one gigantic capacity, but two or three medium ones splitting different workloads (we’ll discuss this in governance).

Workload Management Across Fabric Experiences

One of the powerful aspects of Microsoft Fabric is that a single capacity can run a diverse set of workloads: Power BI reports, Spark notebooks, data pipelines, real-time KQL databases, AI models, etc. The capacity’s compute is shared by all these workloads. This section explains how to manage and balance different workloads on a capacity.

Unified capacity, multiple workloads: Fabric capacities are multi-tenant across workloads by design – you don’t buy separate capacity for Power BI vs Spark vs SQL. For example, an F64 capacity could simultaneously be handling a Power BI dataset refresh, a SQL warehouse query, and a Spark notebook execution. All consume from the same pool of 64 CUs. This unified model simplifies architecture: “It doesn’t matter if one user is using a Lakehouse, another is running notebooks, and a third is executing SQL – they can all share the same capacity.” All items in workspaces assigned to that capacity draw on its resources.

However, as an admin, you need to be mindful of resource contention: a very heavy job of one type can impact others. Fabric tries to manage this with an intelligent scheduler and the bursting/smoothing mechanism (which prioritizes interactive operations). Still, you should consider the nature of workloads when assigning them to capacities. Some guidance:

Power BI workloads: These include interactive report queries (DAX queries against datasets), dataset refreshes, dataflows, AI visuals, and paginated reports. In the capacity settings, admins have specific Power BI workload settings (for example, enabling the AI workload for cognitive services, or adjusting memory limits for datasets, similar to Power BI Premium settings). Ensure these are configured as needed – e.g., if you plan on using AI visualizations or AutoML in Power BI, make sure the AI workload is enabled on the capacity. Large semantic models (datasets) can consume a lot of memory; by default Fabric will manage their loading and eviction, but you may want to keep an eye on total model sizes relative to capacity. Paginated reports can be enabled if needed (they can be memory/CPU heavy during execution).
Data Engineering & Science (Spark): Fabric provides Spark engines for notebooks and job definitions. By default, when a Spark job runs, it uses a portion of the capacity’s cores. In fact, for Spark workloads, Microsoft has defined that each 1 CU = 2 Spark vCores of compute power. For example, an F32 (32 CU) capacity has 64 Spark vCores available to allocate across Spark clusters. These vCores are dynamically allocated to Spark sessions as users run notebooks or Spark jobs. Spark has a built-in concurrency limit per capacity: if all Spark vCores are in use, additional Spark jobs will queue until resources free up. As an admin, you can allow or disallow workspace admins from configuring Spark pool sizes on your capacity. If you enable it, power users might spin up large Spark executors that use many cores – beneficial for performance, but potentially starving other workloads. If Spark usage is causing contention, consider limiting the max Spark nodes or advising users to use moderate sizes. Notably, Fabric capacities support bursting for Spark as well – the system can utilize up to 3× the purchased Spark vCores temporarily to run more Spark tasks in parallel. This helps if you occasionally have many Spark jobs at once, but sustained overuse will still queue or throttle. For heavy Spark/ETL scenarios, you might dedicate a capacity just for that to isolate it from BI users.
Data Warehousing (SQL) and Real-Time Analytics (KQL): These workloads run SQL queries or KQL (Kusto Query Language) queries against data warehouses or real-time analytics databases. They consume CPU during query execution and memory for caching data. They are treated as background jobs if run via scheduled processes, or interactive if triggered by a user query. Fabric’s smoothing generally spreads out heavy background query loads over time. Nevertheless, a very expensive SQL query can momentarily spike CPU. As admin, ensure your capacity can handle peak query loads or advise your data teams to optimize queries (like proper indexing on warehouses) to avoid excessive load. There are not many specific toggles for SQL/KQL workloads in capacity settings (beyond enabling the Warehouse or Real-Time Analytics features which are on by default for F and P capacities).
OneLake and data movement: OneLake is the storage foundation for Fabric. While data storage itself doesn’t “consume” capacity CPU (storage is separate), activities like moving data (copying via pipelines), scanning large files, or loading data into a dataframe will use capacity compute. Data integration pipelines (if using Data Factory in Fabric) also run on the capacity. Keep an eye on any heavy data copy or transformation activities, as those are background tasks that could contribute to load.

Isolation and splitting workloads: If you find that certain workloads dominate the capacity, you might consider splitting them onto separate capacities. For instance, a common approach is to separate “self-service BI” and “data engineering” onto different capacities so that a big Spark job doesn’t slow down a business report refresh. Microsoft notes that provisioning multiple capacities can isolate compute for high-priority items or different usage patterns. You could have one capacity dedicated to Power BI content for executives (ensuring their reports are always snappy), and a second capacity for experimental data science projects. This kind of workload isolation via capacities is a governance decision (we will cover more in the governance section). The trade-off is cost and utilization – separate capacities ensure no interference, but you might end up with unused capacity in each if peaks happen at different times. A single capacity shared by all can be more cost-efficient if the workloads’ peak times are complementary.

Tenant settings delegation: In Fabric, some tenant-level settings (for example, certain Power BI tenant settings or workload features) can be delegated to the capacity level. This means you can override a global setting for a specific capacity. For instance, you might have a tenant setting that limits the maximum size of Power BI datasets for Pro workspaces, but for a capacity designated to a specific team, you allow larger models. In the capacity management settings, check the Delegated tenant settings section if you need to tweak such options for one capacity without affecting others. This feature allows granular control, such as enabling preview features or higher limits on a capacity used by advanced users while keeping defaults elsewhere.

Monitoring workload mix: Use the Capacity Metrics App or the Fabric Monitoring Hub to see what types of operations are consuming the most resources. The app can break down usage by item type (e.g., dataset vs Spark vs pipeline) to help identify if one category is the culprit for high utilization. If you notice, for example, that Spark jobs are consistently using the majority of CUs (perhaps visible as high background CPU), it may prompt you to adjust Spark configurations or move some Spark-heavy workspaces off to another capacity.

In summary, Fabric capacities are shared across all workload types, which is great for flexibility but requires good management to ensure balance. Leverage capacity settings to tune specific workloads (Power BI workload enabling, Spark pool limits, etc.), monitor the usage by workload type, and consider logical separation of workloads via multiple capacities if needed. Microsoft Fabric is designed so that the platform itself handles a lot of the balancing (through smoothing of background jobs), but administrator insight and control remain important to avoid any single workload overwhelming the rest.

Isolation and Security Boundaries

Microsoft Fabric capacities play a role in isolation at several levels – performance isolation, security isolation, and even geographic isolation. It’s important to understand what a capacity isolates (and what it doesn’t) within a Fabric tenant, and how to leverage capacities for governance or compliance.

Performance and resource isolation: A capacity is a unit of isolation for compute resources. Compute usage on one capacity does not affect other capacities in the tenant. If Capacity A is overloaded and throttling, it will not directly slow down Capacity B, since each has its own quota of CUs and separate throttling counters. This means you can confidently separate critical workloads by placing them in different capacities to ensure that heavy usage in one area (e.g., a dev/test environment) cannot degrade the performance of another (e.g., production reports). The Fabric platform applies throttling at the capacity scope, so even within the same tenant, one capacity “failing” (hitting limits) doesn’t spill over into another. As noted, there is an exception when it comes to cross-capacity data access: if a Fabric item in Capacity B is trying to query data that resides in Capacity A (for example, a dataset in B accessing a Lakehouse in A via OneLake), then the consuming capacity’s state is what matters for throttling that query. Generally, such cross-capacity consumption is not common except through shared storage like OneLake, and the compute to actually retrieve the data will be accounted to the consumer’s capacity.

Security and content isolation: It’s crucial to realize that a capacity is not a security boundary in terms of data access. All Fabric content security is governed by Entra ID (Azure AD) identities, roles, and workspace permissions, not by capacity. For example, just because Workspace X is on Capacity A and Workspace Y is on Capacity B does not mean users of X cannot access Y – if a user has the right permissions, they can access both. Capacities do not define who can see data; they define where it runs. So if you have sensitive data that only certain users should access, you still must rely on workspace-level security or separate Entra tenants, not merely separate capacities.

That said, capacities can assist with administrative isolation. You can delegate capacity admin roles so that different people manage different capacities. For instance, the finance IT team might be given admin rights to the “Finance Capacity” and they can control which workspaces go into it, without affecting other capacities. Additionally, you can control which workspaces are assigned to which capacity. By limiting capacity assignment rights (via the Contributor permissions setting on a capacity, which you can restrict to specific security groups), you ensure that, say, only approved workspaces/projects go into a certain capacity. This can be thought of as a soft isolation: e.g., only the HR team’s workspaces are placed in the HR capacity, keeping that compute “clean” from others.

Geographical and compliance isolation: If your organization has data residency requirements (for example, EU data must stay in EU datacenters, US data in US), capacities are a useful construct. When you create a capacity, you choose an Azure region for it. Workspaces on that capacity will allocate their Fabric resources in that region. This means you can satisfy multi-geo requirements by having separate capacities in each needed region and assigning workspaces accordingly. It isolates the data and compute to that geography. (Do note that OneLake has a global aspect, but it stores files/objects in the region of the capacity or the region you designate when creating the item. Check Fabric documentation on multi-geo support for details – company examples show deploying capacities per geography).

Tenant isolation: The ultimate isolation boundary is the Microsoft Entra tenant. Fabric capacities exist within a tenant. If you truly need completely separate environments (different user directories, no possibility of data or admin overlap), you would use separate Entra tenants (as was illustrated by Microsoft with one company using two tenants for different divisions). That, however, is a very high level of isolation usually only used in scenarios like M&A, extreme security separation, or multi-tenant services. Within one tenant, capacities give you isolation of compute but not identity.

Network isolation: As a side note, Fabric is a cloud SaaS, but it does provide features like Managed Virtual Networks for certain services (e.g., Data Factory pipelines or Synapse integration). These features allow you to restrict outbound data access to approved networks. While not directly related to capacity, these network security options can be enabled per workspace or capacity environment to ensure data does not leak to the public internet. If your organization requires network isolation, investigate Fabric’s managed VNet and private link support for the relevant workloads.

In summary, use capacities to create performance and administrative isolation within your tenant. Assign sensitive or mission-critical workloads their own capacity so they are shielded from others’ activity. But remember that all capacities under a tenant still share the same identity and security context; manage access via roles and perhaps use separate tenants if absolute isolation is needed. Also use capacities for geo-separation if needed by creating them in the appropriate regions.

Monitoring and Metrics

Continuous monitoring of capacity health and usage is vital to ensure you are getting the most out of your capacity and to preempt any issues like throttling. Microsoft Fabric provides several tools and metrics for capacity and workload monitoring.

Capacity Utilization Metrics: The primary tool for capacity admins is the Fabric Capacity Metrics App. This is a Power BI app (or report template) provided by Microsoft that connects to your capacity’s telemetry. It offers dashboards showing CPU utilization (%) over time, broken down by workloads and item types. You can see, for example, how much CPU was used by Spark vs datasets vs queries, etc., and identify the top consuming activities. The app typically looks at recent usage (last 7 days or 30 days) in 30-second intervals. Key visuals include the Utilization chart (showing how close to capacity limit you are) and possibly specific charts for interactive vs background load. As an admin, you should regularly review these metrics. Spikes to 100% indicate that you’re using all available CUs and likely bursting beyond capacity (which could lead to throttling if sustained). If you notice consistent high usage, it may be time to optimize or scale up.

Throttling indicators: Monitoring helps reveal if throttling is occurring. In Fabric, throttling can manifest as delays or failures of operations when the capacity is overextended. The metrics app might show when throttling events happen (e.g., a drop in throughput or specific events count). Additionally, some signals of throttling include user reports of slowness, refresh jobs taking longer or failing with capacity errors, or explicit error messages. Fabric may return an HTTP 429 or 430 error for certain overloaded scenarios (for example, Spark jobs will give a specific error code 430 if capacity is at max concurrency). As admin, watch for these in logs or user feedback.

Real-time monitoring: For current activity, the Monitoring Hub in the Fabric portal provides a view of running and recent operations across the tenant. You can filter by capacity to see what queries, refreshes, Spark jobs, etc., are happening “now” on a capacity and their status. This is useful if the capacity is suddenly slow – you can quickly check if a particular job is consuming a lot of resources. The Monitoring Hub will show active operations and those queued or delayed due to capacity.

Administrator Monitoring Workspace: Microsoft has an Admin Monitoring workspace (sometimes automatically available in the tenant or downloadable) that contains some pre-built reports showing usage and adoption metrics. This might include things like the most active workspaces, most refreshed datasets, etc., across capacities. It’s more about usage analytics, but it can help identify which teams or projects are heavily using the capacity.

External monitoring (Log Analytics): For more advanced needs, you can connect Fabric (especially Power BI aspects) to Azure Log Analytics to capture certain logs, and also collect logs from the On-premises Data Gateway (if you use one). Log Analytics might collect events like dataset refresh timings, query durations, etc. While not giving direct CPU usage, these can help correlate if failures coincide with high load times.

Key metrics to watch:

CPU Utilization %: How close to max CUs you are over time. Spikes to 100% sustained for multiple minutes are a red flag.
Memory: Particularly for Power BI (dataset memory consumption) – if you load multiple large models, ensure they fit in memory. The capacity metrics app shows memory usage per dataset. If near the limits, consider larger capacity or offloading seldom-used models.
Active operations count: Many concurrent operations (queries, jobs) can hint at saturation. For instance, if dozens of queries run simultaneously, you might hit limits even if each is light.
Throttle events: If the metrics indicate delayed or dropped operations, or the Fabric admin portal shows notifications of throttling, that’s a clear indicator.

Notifications: A best practice is to set up alerts/notifications when capacity usage is high. The Fabric capacity settings allow you to configure email notifications if utilization exceeds a certain threshold for a certain time. For example, you might set a notification if CPU stays over 80% for more than 5 minutes. This proactive alert can prompt you to intervene (perhaps scale up capacity or investigate the cause) before users notice major slowdowns.

SLA and user experience: Ultimately, the reason we monitor is to ensure a good user experience. Identify patterns like time of day spikes (maybe every Monday 9AM there’s a huge hit) and mitigate them (maybe by rescheduling some background tasks). Also track the performance of key reports or jobs over time – if they start slowing down, it could be capacity pressure.

In summary, leverage the available telemetry: Fabric Capacity Metrics App for historical trends, Monitoring Hub for real-time oversight, and set up alerts. By keeping a close eye on capacity metrics, you can catch issues early (such as creeping utilization that approaches limits) and take action – whether optimization, scaling, or spreading out the workload – to maintain smooth operations.

Autoscale and Bursting: Managing Peak Loads

One of the novel features of Microsoft Fabric’s capacity model is how it handles peak demands through bursting and smoothing, effectively providing an “autoscaling” experience within the capacity. In this section, we explain these concepts and how to plan for bursts, as well as other autoscale options (such as manual scale-out and Spark autoscaling).

Bursting and smoothing: Fabric is designed to deliver fast performance, even for short spikes in workload, without requiring you to permanently allocate capacity for the peak. It does this via bursting, which allows the capacity to temporarily use more compute than its provisioned CU limit when needed. In other words, your capacity can “burst” above 100% utilization for a short period so that intensive operations finish quickly. This is complemented by smoothing, which is the system’s way of averaging out that burst usage over time so that you’re not immediately penalized. Smoothing spreads the accounting of the consumed CUs over a longer window (5 minutes for interactive operations, up to 24 hours for background operations).

Put simply: “Bursting lets you use more power than you purchased (within a specific timeframe), and smoothing makes sure this over-use is under control by spreading its impact over time.”. For example, if you have an F64 capacity but a particular query needs the equivalent of 128 CUs for a few seconds, Fabric will allow it – the job will complete faster thanks to bursting beyond 64 CUs. Then, the “excess” usage is smoothed into subsequent minutes (meaning for some time after, the capacity’s available headroom is reduced as it pays back that borrowed compute). This mechanism gives an effect similar to short-term autoscaling: the capacity behaves as if it scaled itself up to handle a bursty load, then returns to normal.

Throttling and limits: Bursting is not infinite – it’s constrained by how much future capacity you can borrow via smoothing. Fabric has a throttling policy that kicks in if bursts go on too long or too high. The system tolerates using up to 10 minutes of future capacity with no throttling (this is like a built-in grace period). If you consume more than 10 minutes worth of CUs in advance, Fabric will start applying gentle throttling: interactive operations get a small 20-second delay on submission when between 10 and 60 minutes of capacity overage is consumed. This is phase 1 throttling – users might notice a slight delay but operations still run. If the capacity has consumed over an hour of future CUs (meaning it’s been running well above its quota for a sustained period), it enters phase 2 where interactive operations are rejected outright (while background jobs can still start). Finally, if over 24 hours of capacity is consumed (an extreme overload), all operations (interactive and background) are rejected until usage recovers. The table below summarizes these stages:

Excess Usage (beyond capacity)	System Behavior	Impact
Up to 10 minutes of future capacity	Overage protection (bursting)	No throttling; operations run normally.
10 – 60 minutes of overuse	Interactive delay	New interactive operations (user queries, etc.) are delayed ~20s in queue. Background jobs still start immediately.
60 minutes – 24 hours of overuse	Interactive rejection	New interactive operations are rejected (fail immediately). Background jobs continue to run/queue.
Over 24 hours of overuse	Full rejection	All new operations are rejected (both interactive and background) until the capacity “catches up”.

Table: Throttling thresholds in Fabric’s capacity model. Fabric bursts up to 10 minutes with no penalty. Beyond that, throttling escalates in stages to protect the system.

For most well-managed capacities, you ideally operate in the safe zone (under 10 minutes overage) most of the time. Occasional dips into the 10-60 minute range are fine (users might not even notice the minor delays). If you ever hit the 60+ minute range, that’s a sign the capacity is under-provisioned for the workload or a particular job is too heavy – it should prompt optimization or scaling.

Autoscaling options: Unlike some cloud services that spin up new instances automatically, Fabric’s approach to autoscale is primarily through bursting (which is automatic but time-limited). However, you do have some manual or semi-automatic options:

Manual scale-up/down: Because F-SKUs are purchased via Azure, you can scale the capacity resource to a different SKU on the fly (e.g., from F64 to F128 for a day, then back down). If you have a reserved base (like an F64 reserved instance), you can temporarily scale up using pay-as-you-go to a larger SKU to handle a surge. For instance, an admin might anticipate heavy year-end processing and raise the capacity for that week. Microsoft will bill the overage at the hourly rate for the higher SKU during that period. This is a proactive autoscale you perform as needed. It’s not automatic, but you could script it or use Azure Automation/Logic Apps to trigger scaling based on metrics (there are solutions shared by the community to do exactly this).
Scale-out via additional capacity: Another approach if facing continual heavy load is to add another capacity and redistribute work. For example, if one capacity is maxed out daily, you could purchase a second capacity and move some workspaces to it (spreading the load). This isn’t “autoscale” per se (since it’s a static split unless you later combine them), but it’s a way to increase total resources. Because Fabric charges by capacity usage, two F64s cost the same as one F128 in pay-go terms, so cost isn’t a downside, and you gain isolation benefits.
Spark autoscaling within capacity: For Spark jobs, Fabric allows configuration of auto-scaling Spark pools (the number of executors can scale between a min and max) which optimizes resource usage for Spark jobs. This feature, however, operates within the capacity’s limits – it won’t exceed the total cores available unless bursting provides headroom. It simply means a Spark job will request more nodes if needed and free them when done, up to what the capacity can supply. There is also a preview feature called Spark Autoscale Billing which, if enabled, can offload Spark jobs to a completely separate serverless pool billed independently. That effectively bypasses the capacity for Spark (useful if you don’t want Spark competing with your capacity at all), but since it’s a preview and separate billing, most admins will primarily consider it if Spark is a huge part of their usage and they want a truly elastic experience.
Surge Protection: Microsoft introduced surge protection (currently in preview) for Fabric capacities, which is a setting that limits the total amount of background compute that can run when the capacity is under strain. If enabled, when interactive activities surge, the system will start rejecting background jobs preemptively so that interactive users aren’t as affected. This doesn’t give more capacity, but it triages usage to favor user-driven queries. It’s a protective throttle that helps the capacity recover faster from a spike. As an admin, if you have critical interactive workloads, you might turn this on to ensure responsiveness (at the cost of some background tasks failing and needing retry).

Clearing overuse: If your capacity does get into a heavily throttled state (e.g., many hours of overuse accumulated), one way to reset is to pause and resume the capacity. Pausing essentially stops the capacity (dropping all running tasks) and when resumed, it starts fresh with no prior overhang – but note, any un-smoothed burst usage gets immediately charged at that point. In effect, pausing is like paying off your debt instantly (since when the capacity is off, you can’t “pay back” with idle time, so you are billed for the overage). This is a drastic action (users will be disrupted by a pause), so it’s not a routine solution, but in extreme cases an admin might do this during off hours to clear a badly throttled capacity. Typically, optimizing the workload or scaling out is preferable to hitting this situation.

Design for bursts: Thanks to bursting, you don’t have to size your capacity for the absolute peak if it’s short-lived. Plan for the daily average or slightly above instead of the worst-case peak. Bursting will handle the occasional spike that is, say, 2-3× your normal usage for a few minutes. For example, if your daily work typically uses ~50 CUs but a big refresh at noon spikes to 150 CUs for 1 minute, an F64 capacity can still handle it by bursting (150/64 = ~2.3x for one minute, which smoothing can cover over the next several minutes). This saves cost because you avoid buying an F128 just for that one minute. The system’s smoothing will amortize that one minute over the next 5-10 minutes of capacity. However, if those spikes start lasting 30 minutes or happening every hour, then you do effectively need a larger capacity or you’ll degrade performance.

In conclusion, Fabric’s bursting and smoothing provide a built-in cushion for peaks, acting as an automatic short-term autoscale. As an admin, you should still keep an eye on how often and how deeply you burst (via metrics), and use true scaling strategies (manual scale-up or adding capacity) if needed for sustained load. Also take advantage of features like Spark pool autoscaling and surge protection to further tailor how your capacity handles variable workloads. The combination of these tools ensures you can maintain performance without over-provisioning for rare peaks, achieving a cost-effective balance.

Governance and Best Practices for Capacity Assignment

Managing capacities is not just about the hardware and metrics – it also involves governance: deciding how capacities are used within your organization, which workspaces go where, and enforcing policies to ensure efficient and secure usage. Here are best practices and guidelines for capacity and tenant admins when assigning and governing capacities.

1. Organize capacities by function, priority, or domain: It often makes sense to allocate different capacities for different purposes. For example, you might have a capacity dedicated to production BI content (high priority reports for executives) and another for self-service and development work. This way, heavy experimentation in the dev capacity cannot interfere with the polished dashboards in prod. Microsoft gives an example of using separate capacities so that executives’ reports live on their own capacity for guaranteed performance. Some common splits are:

By department or business unit: e.g., Finance has a capacity, Marketing has another – helpful if departments have very different usage patterns or need cost accountability.
By workload type: e.g., one capacity for all Power BI reports, another for data engineering pipelines and science projects. This can minimize cross-workload contention.
By environment: e.g., one for Production, one for Test/QA, one for Development. This aligns with software lifecycle management.
By geography: as discussed, capacities by region (EMEA vs Americas, etc.) if data residency or local performance is needed.

Having multiple capacities incurs overhead (you must monitor and manage each), so don’t over-segment without reason. But a thoughtful breakdown can improve both performance isolation and clarity in who “owns” the capacity usage.

2. Control workspace assignments: Not every workspace needs to be on a dedicated capacity. Some content can live in the shared (free) capacity if it doesn’t need premium features. As an admin, you should have a process for requesting capacity assignment. You might require that a workspace meet certain criteria (e.g., it’s for a project that requires larger dataset sizes or will have broad distribution) before assigning it to the premium capacity. This prevents trivial or personal projects from consuming expensive capacity resources. In Fabric, you can restrict the ability to assign a workspace to a capacity by using Capacity Contributor permissions. By default, it might allow the whole organization, but you can switch it to specific users or groups. A best practice is to designate a few power users or a governance board that can add workspaces to the capacity, rather than leaving it open to all.

Also consider using the “Preferred capacity for My workspace” setting carefully. Fabric allows you to route user personal workspaces (My Workspaces) to a capacity. While this could utilize capacity for personal analyses, it can also easily overwhelm a capacity if many users start doing heavy work in their My Workspace. Many organizations leave My Workspaces on shared capacity (which requires those users to have Pro licenses for any Power BI content in them) and only put team or app workspaces on the Fabric capacities.

3. Enforce capacity governance policies: There may be tenant-level settings you want to enforce or loosen per capacity. For instance, perhaps in a special capacity for data science you allow higher memory per dataset or allow custom Visualizations that are otherwise disabled. Use the delegated tenant settings feature to override settings on specific capacities as needed. Another example: you might want to disable certain preview features or enforce specific data export rules in a production capacity for security, while allowing them in a dev capacity.

4. Educate workspace owners: Ensure that those who have their workspace on a capacity know the “dos and don’ts.” They should understand that it’s a shared resource – e.g., a badly written query or an extremely large dataset refresh can impact others. Encourage best practices like scheduling heavy refreshes during off-peak times, enabling incremental refresh for large datasets (to reduce refresh load), optimizing DAX and SQL queries, and so on. Capacity admins can provide guidelines or even help review content that will reside on the capacity.

5. Leverage monitoring for governance: Keep track of which workspaces or projects are consuming the most capacity. If one workspace is monopolizing resources (you can see this in metrics, which identify top items), you might decide to move that workspace to its own capacity or address the inefficiencies. You can even implement an internal chargeback or at least show departments how much capacity they consumed to promote accountability.

6. Plan for lifecycle and scaling: Governance also means planning how to scale or reassign as needs change. If a particular capacity is consistently at high load due to growth of a project, have a strategy to either scale that capacity or redistribute workspaces. For example, you might spin up a new capacity and migrate some workspaces to it (admins can change a workspace’s capacity assignment easily in the portal). Microsoft notes you can “scale out” by moving workspaces to spread workload, which is essentially a governance action as much as a performance one. Also, when projects are retired or become inactive, don’t forget to remove their workspaces from capacity (or even delete them) so they don’t unknowingly consume resources with forgotten scheduled operations.

7. Security considerations: While capacity doesn’t enforce security, you can use capacity assignment as part of a trust boundary in some cases. For instance, if you have a workspace with highly sensitive data, you might decide it should run on a capacity that only that team’s admins control (to reduce even the perception of others possibly affecting it). Also, if needed, capacities can be tied to different encryption keys (Power BI allows BYOK for Premium capacities) – check if Fabric supports BYOK per capacity if that’s a requirement.

8. Documentation and communication: Treat your capacities as critical infrastructure. Document which workspaces are on which capacity, what the capacity sizes are, and any rules associated with them. Communicate to your user community about how to request space on a capacity, what the expectations are (like “if you are on the shared capacity, you get only Pro features; if you need Fabric features, request placement on an F SKU” or vice versa). Clear guidelines will reduce ad-hoc and potentially improper use of the capacities.

In essence, governing capacities is about balancing freedom and control. You want teams to benefit from the power of capacities, but with oversight to ensure no one abuses or unknowingly harms the shared environment. Using multiple capacities for natural boundaries (dept, env, workload) and controlling assignments are key techniques. As a best practice, start somewhat centralized (maybe one capacity for the whole org in Fabric’s early days) and then segment as you identify clear needs to do so (such as a particular group needing isolation or a certain region needing its own). This way you keep things manageable and only introduce complexity when justified.

Cost Optimization Strategies

Managing cost is a major part of capacity administration, since dedicated capacity represents a significant investment. Fortunately, Microsoft Fabric offers several ways to optimize costs while meeting performance needs. Here are strategies to consider:

1. Use Pay-as-you-go wisely (pause when idle): F-SKUs on Azure are billed on a per-second basis (with a 1-minute minimum) whenever the capacity is running. This means if you don’t need the capacity 24/7, you can pause it to stop charges. For example, if your analytics workloads are mostly 9am-5pm on weekdays, you could script the capacity to pause at night and on weekends. You only pay for the hours it’s actually on. An F8 capacity left running 24/7 costs roughly $1,200 per month, but if you paused it outside of an 8-hour workday, the cost could drop to a third of that (plus no charge on weekends). Always assess your usage patterns – some organizations run critical reports around the clock, but many could save by pausing during predictable downtime. The Fabric admin portal allows pause/resume, and Azure Automation or Logic Apps can schedule it. Just ensure no important refresh or user query is expected during the paused window.

2. Right-size the SKU (avoid over-provisioning): It might be tempting to get a very large capacity “just in case,” but unused capacity is money wasted. Thanks to bursting, you can usually size for slightly above your average load, not the absolute peak. Monitor utilization and if you see your capacity is consistently under 30% utilized, that’s a sign you might scale down to a smaller SKU and save costs (unless you’re expecting growth or deliberately keeping headroom). The granular SKU options (F2, F4, F8, etc.) let you fine-tune. For instance, if F64 is too much and F32 occasionally struggles, an F48 would be ideal – while not an official SKU, you could achieve an “F48” by using reserved capacity units (more on that below) to split or by alternating scheduling (though that’s complex). Generally, stick to SKUs but choose the lowest one that meets requirements with maybe some buffer.

3. Reserved capacity (annual commitment) for lower rates: Pay-as-you-go is flexible but at a higher unit price. Microsoft has indicated and demonstrated that reserved instance pricing for F-SKUs brings significant cost savings (on the order of ~40% cheaper for a 1-year commitment). For example, an F8 costs around €1188/month pay-go, but ~€706/month with a 1-year reservation. If you know you will need a capacity continuously for a long period, consider switching to a reserved model to reduce cost. Importantly, when you reserve, you are reserving a certain number of capacity units, not locking into a specific SKU size. So you could reserve 64 CUs (the equivalent of F64) but choose to run two F32 capacities or one F64 – as long as total CUs in use ≤64, it’s covered by your reservation. This allows flexibility in how you deploy those reserved resources (multiple smaller capacities vs one big one). Also, with reservation, you can still scale up beyond your reserved amount and just pay the excess at pay-go rates. For instance, you reserve F8 (8 CUs) but occasionally scale to F16 for a day – you’d pay the 8 extra CUs at pay-go just for that time. This hybrid approach ensures you get savings on your baseline usage and only pay premium for surges.

4. Monitor and optimize workload costs: Cost optimization can also mean making workloads more efficient so they consume fewer CUs. Encourage good practices like using smaller dataset refresh intervals (don’t over-refresh), turning off refresh for datasets not in use, archiving or deleting old large datasets, using incremental refresh, etc. For Spark, make sure jobs are not running with unnecessarily large clusters idle (auto-terminate them when done, which Fabric usually handles). If using the serverless Spark billing preview, weigh its cost (it might be cheaper if your Spark usage is sporadic, versus holding capacity for it).

5. Mix license models for end-users: Not everyone in your organization needs to use the capacity. You can have a hybrid of Premium capacity and Premium Per User. For example, perhaps you buy a small capacity for critical shared content, but for many other smaller projects, you let teams use PPU licenses on the shared (free) capacity. This way you’re not putting everything on the capacity. As mentioned, PPU is cost effective up to a point (if many users need it, capacity becomes cheaper). You might say: content intended for large audiences goes on capacity (so free users can consume it), whereas content for small teams stays with PPU. Such a strategy can yield substantial savings. It also provides a path for scaling: as a particular report or solution becomes widely adopted, you can move it from the PPU world to the capacity.

6. Utilize lower-tier SKUs and scale out: If cost is a concern and ultra-high performance isn’t required, you could opt for multiple smaller capacities instead of one large one. For example, two F32 capacities might be cheaper in some scenarios than one F64 if you can pause them independently or if you got a deal on smaller ones. That said, Microsoft’s pricing is generally linear with CUs, so two F32 should cost roughly the same as one F64 in pay-go. The advantage would be if you can pause one of them for periods when not needed. Be mindful though: capacities below F64 won’t allow free user report viewing, which could force Pro licenses and shift cost elsewhere.

7. Keep an eye on OneLake storage costs: Fabric capacity covers compute. Storage in OneLake is billed separately (at a certain rate per GB per month). Microsoft’s current OneLake storage cost (~$0.022 per GB/month in one region example) is relatively low, but if you are landing terabytes of data, it will add up. It usually won’t overshadow compute costs, but from a governance perspective, try to clean up unused data (e.g., old versioned data, intermediate files) to avoid an ever-growing storage bill. Also, data egress (moving data out of the region) could have costs, but if staying within Fabric likely not an issue.

8. Periodically review usage and adjust: Cost optimization is not a one-time set-and-forget. Each quarter or so, review your capacity’s utilization and cost. Are you paying for a large capacity that’s mostly idle? Scale it down or share it with more workloads (to get more value out of it). Conversely, if you’re consistently hitting the limits and had to enable frequent autoscale (pay-go overages), maybe committing to a higher base SKU could be more economical. Remember, if you went with a reserved instance, you already paid upfront – ensure you are using what you paid for. If you reserved an F64 but only ever use 30 CUs, you might repurpose some of those CUs to another capacity (e.g., split into F32 + F32) so that more projects can utilize the prepaid capacity.

9. Leverage free/trial features: Make full use of the 60-day Fabric trial capacity before purchasing. It’s free compute time – treat it as such to test heavy scenarios and get sizing estimates without incurring cost. Also, if certain features remain free or included (like some amount of AI functions or some small dataset sizes not counting, etc.), be aware and use them.

10. Watch for Microsoft licensing changes or offers: Microsoft’s cloud services pricing can evolve. For instance, the deprecation of P-SKUs might come with incentives or migration discounts to F-SKUs. There could be offers for multi-year commitments. Stay informed via the Fabric blog or your Microsoft rep for any cost-saving opportunities.

In practice, many organizations find that moving to Fabric F-SKUs saved money compared to the old P-SKUs, if they manage the capacity actively (pausing when not needed, etc.). One user noted Fabric capacity is “significantly cheaper than Power BI Premium capacity” if you utilize the flexible billing. But this is only true if you take advantage of the flexibility – otherwise pay-go could actually cost more than an annual P-SKU if left running 24/7 at high rate. Thus, the onus is on the admin to optimize runtime.

By combining these strategies – dynamic scaling, reserved discounts, license mixing, and efficient usage – you can achieve an optimal balance of performance and cost. The result should be that your organization pays for exactly the level of analytics power it needs, and not a penny more, while still delivering a good user experience.

Real-World Use Cases and Scenario-Based Recommendations

To tie everything together, let’s consider a few typical scenarios and how one might approach capacity management in each:

Scenario 1: Small Business or Team Starting with Fabric
A 50-person company with a small data team is adopting Fabric primarily for Power BI reports and a few dataflows.
Approach: Begin with the Fabric Trial (F64) to pilot your content. Likely an F64 provides ample power for 50 users. During the trial, monitor usage – it might show that even an F32 would suffice if usage is light. Since 50 users is below the ~250 threshold, one option after trial is to use Premium Per User (PPU) licenses instead of buying capacity (each power user gets PPU so they have premium features, and content runs on shared capacity). This could be cheaper initially. However, if the plan is to roll out company-wide reports that everyone consumes, a capacity is beneficial so that even free users can view. In that case, consider purchasing a small F SKU on pay-go, like F32 or F64 depending on trial results. Use pay-as-you-go and pause it overnight to save money. With an F32 (which is below Premium threshold), remember that viewers will need Pro licenses – if you want truly all 50 users (including some without Pro) to access, go with at least F64. Given cost, you might decide on PPU for all 50 instead of F64, which could be more economical until the user base or needs grow. Keep governance light but educate the small team on not doing extremely heavy tasks that might require bigger capacity. Likely one capacity is enough; no need to split by departments since the org is small.

Scenario 2: Mid-size Enterprise focusing on Enterprise BI
A 1000-person company has a BI Center of Excellence that will use Fabric primarily for Power BI (reports & datasets), replacing a P1 Premium. Minimal use of Spark or advanced workloads initially.
Approach: They likely need a capacity that allows free user consumption of reports – so F64 or larger. Given they had a P1, F64 is the equivalent. Use F64 reserved for a year to save about 40% cost over monthly, since they know they need it continuously. Monitor usage: if adoption grows (more reports, bigger datasets), they should watch if utilization nears limits. Perhaps they’ll consider scaling to F128 in the future. In terms of governance, set up one primary capacity for Production BI content. Perhaps also spin up a smaller F32 trial or dev capacity for development and testing of reports, so heavy model refreshes in dev don’t impact prod. The dev capacity could even be paused except during working hours to save cost. For user licensing, since content on F64 can be viewed by free users, they can give all consumers just Fabric Free licenses. Only content creators (maybe ~50 BI developers) need Pro licenses. Enforce that only the BI team can assign workspaces to the production capacity (so random workspaces don’t sneak in). Use the metrics app to ensure no one workspace is hogging resources; if a particular department’s content is too heavy, maybe allocate them a dedicated capacity (e.g. buy another F64 for that department if justified).

Scenario 3: Data Science and Engineering Focus
A tech company with 200 data scientists and engineers plans to use Fabric for big data processing, machine learning, and some reporting. They expect heavy Spark usage and big warehouses; less focus on broad report consumption.
Approach: Since their usage is compute-heavy but not necessarily thousands of report viewers, they might prioritize raw power over Premium distribution. Possibly they could start with an F128 or F256, even if many of their users have Pro licenses anyway (so free-viewer capability isn’t the concern, capacity for compute is). They might split capacities by function: one “AI/Engineering” capacity and one “BI Reporting” capacity. The AI one might be large (to handle Spark clusters, etc.), and the BI one can be smaller if report usage is limited to internal teams with Pro. If cost is a concern, they could try an alternative: keep one moderate capacity and use Spark autoscale billing (serverless Spark) for big ML jobs so that those jobs don’t eat capacity – essentially offloading big ML to Azure Databricks or Spark outside of Fabric. But if they want everything in Fabric, an ample capacity with bursting will handle a lot. They should use Spark pool auto-scaling and perhaps set conservative defaults to avoid any single user grabbing too many cores. Monitor concurrency – if Spark jobs queue often, maybe increase capacity or encourage using pipeline scheduling to queue non-urgent jobs. For cost, they might run the capacity 24/7 if pipelines run round the clock. Still, if nights are quiet, pause then. Because these users are technical, requiring them to have Pro or PPU is fine; they may not need to enable free user access at all. If they do produce some dashboards for a wider audience, those could be on a smaller separate capacity (or they give those viewers PPU licenses). Overall, ensure the capacity is in a region close to the data lake for performance, and consider enabling private networking since they likely deal with secure data.

Scenario 4: Large Enterprise, Multiple Departments
A global enterprise with several divisions, all adopting Fabric for different projects – some heavy BI, some data warehousing, some real-time analytics.
Approach: This calls for a multi-capacity strategy. They might purchase a pool of capacity units (e.g., 500 CUs reserved) and then split into multiple capacities: e.g., an F128 for Division A, F128 for Division B, F64 for Division C, etc., up to the 500 CU total. This way each division can manage its own without impacting others, and the company benefits from a bulk reserved discount across all. They should designate a capacity admin for each to manage assignments. They should also be mindful of region – maybe an F128 in EU for the European teams, another in US for American teams. Use naming conventions for capacities (e.g., “Fabric_CAP_EU_Prod”, “Fabric_CAP_US_Marketing”). They might also keep one smaller capacity as a “sandbox” environment where any employee can try Fabric (kind of like a community capacity) – that one might be monitored and reset often. Cost-wise, they will want reserved instances for such scale and possibly 3-year commitments if confident (those might bring even greater discounts in the future). Regular reviews might reveal one division not using their full capacity – they could decide to resize that down and reallocate CUs to another that needs more (taking advantage of the flexibility that reserved CUs are not tied to one capacity shape). The governance here is crucial: a central team should set overall policies (like what content must be where, and ensure compliance and security are uniform), while delegating day-to-day to local admins.

Scenario 5: External Facing Embedded Analytics
A software vendor wants to use Fabric to embed Power BI reports in their SaaS product for their external customers.
Approach: This scenario historically used A-SKUs or EM-SKUs. With Fabric, they have options: they could use an F-SKU which also supports embedding, or stick with A-SKU if they don’t need Fabric features. If they only care about embedding reports and want to minimize cost, an A4 (equivalent to F64) might be slightly cheaper if they don’t need the rest of Fabric (plus A4 can be paused too). However, if they think of using Fabric’s dataflows or other features to prep data, going with an F-SKU might be more future-proof. Assuming they choose an F-SKU, they likely need at least F8 or F16 to start (depending on user load) because EM/A SKUs start at that scale for embedding anyway. They can scale as their customer base grows. They will treat this capacity as dedicated to their application. They should isolate it from internal corporate capacities. Cost optimization here is to scale with demand: e.g., scale up during business hours if that’s when customers use the app, and scale down at night or pause if no one accesses at 2 AM. But since external users might be worldwide, they might run it constantly and possibly consider multi-geo capacities to serve different regions for latency. They must also handle licensing properly: external users viewing embedded content do not need Pro licenses; the capacity covers that. So the capacity cost is directly related to usage the vendor expects (if many concurrent external users, need higher SKU). Monitoring usage patterns (peak concurrent users driving CPU) will guide scaling and cost.

These scenarios highlight that capacity management is flexible – you adapt the strategy to your specific needs and usage patterns. There is no one-size-fits-all, but the principles remain consistent: use data to make decisions, isolate where necessary, and take advantage of Fabric’s elasticity to optimize both performance and cost.

Conclusion

Microsoft Fabric capacities are a powerful enabler for organizational analytics at scale. By understanding the different capacity types, how to license and size them, and how Fabric allocates resources across workloads, administrators can ensure their users get a fast, seamless experience. We covered how to plan capacity size (using tools and trial runs), how to manage mixed workloads on a shared capacity, and how Fabric’s unique bursting and smoothing capabilities help handle peaks without constant overspending. We also delved into monitoring techniques to keep an eye on capacity health and discussed governance practices to allocate capacity resources wisely among teams and projects. Finally, we explored ways to optimize costs – from pausing unused capacity to leveraging reserved pricing and choosing the right licensing mix.

In essence, effective capacity management in Fabric requires a balance of technical tuning and organizational policy. Administrators should collaborate with business users and developers alike: optimizing queries and models (to reduce load), scheduling workloads smartly, and scaling infrastructure when needed. With careful management, a Fabric capacity can serve a wide array of analytics needs while maintaining strong performance and staying within budget. We encourage new capacity admins to start small, iterate, and use the rich monitoring data available – over time, you will develop an intuition for your organization’s usage patterns and how to adjust capacity to match. Microsoft Fabric’s capacities, when well-managed, will provide a robust, flexible foundation for your data-driven enterprise, allowing you to unlock insights without worrying that resources will be the bottleneck. Happy capacity managing!

Sources:

Microsoft Fabric documentation – Concepts and Licenses, Microsoft Learn
Microsoft Fabric documentation – Plan your capacity size, Microsoft Learn
Microsoft Fabric documentation – Evaluate and optimize your capacity, Microsoft Learn
Microsoft Fabric documentation – Capacity throttling policy, Microsoft Learn
Data – Marc blog – Power BI and Fabric capacities: Cost structure, June 2024
Microsoft Fabric documentation – Fabric trial license, Microsoft Learn
Microsoft Fabric documentation – Capacity settings (admin), Microsoft Learn
Dataroots.io – Fabric pricing, billing, and autoscaling, 2023
Medium – Adrian B. – Fabric Capacity Management 101, 2023
Microsoft Fabric documentation – Spark concurrency limits, Microsoft Learn
Microsoft Fabric community – Fabric trial capacity limits, 2023 (trial is 60 days)
Microsoft Fabric documentation – Throttling stages, Microsoft Learn

Download PDF copy – Microsoft Fabric Capacity Management_ A Comprehensive Guide for Administrators.pdf

Why Notebook Snapshots in Microsoft Fabric Are a Debugging Gamechanger—No, Seriously!

If you’ve ever experienced the sheer agony of debugging notebooks—those chaotic, tangled webs of code, markdown, and occasional tears—you’re about to understand exactly why Notebook Snapshots in Microsoft Fabric aren’t just helpful, they’re borderline miraculous. Imagine the emotional rollercoaster of meticulously crafting a beautifully intricate notebook, only to watch it crumble into cryptic errors and obscure stack traces with no clear clue of what went wrong, when, or how. Sound familiar? Welcome to notebook life.

But fear not, weary debugger. Microsoft Fabric is finally here to rescue your productivity—and possibly your sanity—through the absolute genius of Notebook Snapshots.

Let’s Set the Scene: The Notebook Debugging Nightmare

To fully appreciate the brilliance behind Notebook Snapshots, let’s first vividly recall the horrors of debugging notebooks without them.

Step 1: You enthusiastically write and run a series of notebook cells. Everything looks fine—until, mysteriously, it doesn’t.

Step 2: A wild error appears! Frantically, you scroll back up, scratching your head and questioning your life choices. Was it Cell 17, or perhaps Cell 43? Who knows at this point?

Step 3: You begin the tiresome quest of restarting the kernel, selectively re-running cells, attempting to recreate that perfect storm of chaos that birthed the bug. Hours pass, frustration mounts, coffee runs out—disaster ensues.

Sound familiar? Of course it does, we’ve all been there.

Enter Notebook Snapshots: The Hero We Didn’t Know We Needed

Notebook Snapshots in Microsoft Fabric aren’t simply another fancy “nice-to-have” feature; they’re an absolute lifeline for notebook developers. Essentially, Notebook Snapshots capture a complete state of your notebook at a specific point in time—code, outputs, errors, and all. They let you replay and meticulously analyze each step, preserving context like never before.

Think of them as your notebook’s personal rewind button: a time-traveling companion ready to transport you back to that critical moment when everything broke, but your optimism was still intact.

But Why Exactly is This Such a Gamechanger?

Great question—let’s get granular.

1. Precise State Preservation: Say Goodbye to Guesswork

The magic of Notebook Snapshots is in their precision. No more wondering which cell went rogue. Snapshots save the exact state of your notebook’s cells, outputs, variables, and even intermediate data transformations. This precision ensures that you can literally “rewind” and step through execution like you’re binging your favorite Netflix series. Missed something crucial? No worries, just rewind.

Benefit: You know exactly what the state was before disaster struck. Debugging transforms from vague guesswork to precise, surgical analysis. You’re no longer stumbling in the dark—you’re debugging in 4K clarity.

2. Faster Issue Replication: Less Coffee, More Debugging

Remember spending hours trying to reproduce obscure bugs that vanished into thin air the moment someone else was watching? Notebook Snapshots eliminate that drama. They capture the bug in action, making it infinitely easier to replicate, analyze, and ultimately squash.

Benefit: Debugging time shrinks dramatically. Your colleagues are impressed, your boss is delighted, and your coffee machine finally gets a break.

3. Collaboration Boost: Debug Together, Thrive Together

Notebook Snapshots enable teams to share exact notebook states effortlessly. Imagine sending your team a link that perfectly encapsulates your debugging context. No lengthy explanations needed, no screenshots required, and definitely no more awkward Slack messages like, “Ummm… it was working on my machine?”

Benefit: Everyone stays synchronized. Collective debugging becomes simple, fast, and—dare we say it—pleasant.

4. Historical Clarity: The Gift of Hindsight

Snapshots build a rich debugging history. You can examine multiple snapshots over time, comparing exactly how your notebook evolved and where problems emerged. You’re no longer relying on vague memory or frantic notebook archaeology.

Benefit: Clearer, smarter decision-making. You become a debugging detective with an archive of evidence at your fingertips.

5. Confidence Boosting: Fearless Experimentation

Knowing you have snapshots lets you innovate fearlessly. Go ahead—experiment wildly! Change parameters, test edge-cases, break things on purpose (just for fun)—because you can always rewind to a known-good state instantly.

Benefit: Debugging stops being intimidating. It becomes fun, bold, and explorative.

A Practical Example: Notebook Snapshots in Action

Imagine you’re exploring a complex data pipeline in a notebook:

You load and transform data.
You run a model.
Suddenly, disaster: a cryptic Python exception mocks you cruelly.

Normally, you’d have to painstakingly retrace your steps. With Microsoft Fabric Notebook Snapshots, the workflow is much simpler:

Instantly snapshot the notebook at the exact moment the error occurs.
Replay each cell execution leading to the error.
Examine exactly how data changed between steps—no guessing, just facts.
Swiftly isolate the issue, correct the bug, and move on with your life.

Just like that, you’ve gone from notebook-induced stress to complete debugging Zen.

A Bit of Sarcastic Humor for Good Measure

Honestly, if you’re still debugging notebooks without snapshots, it’s a bit like insisting on traveling by horse when teleportation exists. Sure, horses are charmingly nostalgic—but teleportation (aka Notebook Snapshots) is clearly superior, faster, and way less messy.

Or, put differently: debugging notebooks without snapshots in 2025 is like choosing VHS tapes over streaming. Sure, the retro vibes might be fun once—but let’s be honest, who wants to rewind tapes manually when you can simply click and replay?

Wrapping It All Up: Notebooks Just Got a Whole Lot Easier

In short, Notebook Snapshots in Microsoft Fabric aren’t merely a convenience—they fundamentally redefine how we approach notebook debugging. They shift the entire paradigm from guesswork and frustration toward clarity, precision, and confident experimentation.

Notebook developers everywhere can finally rejoice: your debugging nightmares are officially canceled.

Thanks, Microsoft Fabric—you’re genuinely a gamechanger.

This post was written with help from ChatGPT

The core idea: stop guessing, start diagnosing

Why sparkwise exists (and the problems it explicitly targets)

What you get: a feature tour that maps to real-world Spark pain

1) Automated diagnostics (the fast “what’s wrong?” pass)

2) Comprehensive profiling (the “what actually happened?” pass)

3) Advanced performance analysis (built on real metrics)

4) Advanced skew detection (because skew kills Spark)

5) SQL query plan analysis (spotting anti-patterns early)

6) Storage optimization suite (new in v1.4.0+)

7) Interactive configuration assistant (the “what does this do?” superpower)

Quick start: the 3 fastest ways to get value

Install

1) Run a full diagnostic on your current session

2) Ask about a specific Spark/Fabric config

3) Profile your run (and pinpoint bottlenecks)

CLI workflows (especially useful for storage optimization)

A realistic “first hour” workflow I’d recommend

Closing: why this matters for Fabric teams

Share this:

Buck Rogers wasn’t a show I watched. It was a place I went.

Princess Ardala, obviously

Because December is already full of ghosts.

What I’m doing with it

Share this:

What Is the Fabric “Job Doctor”?

Part 1: Getting Spark Telemetry Out of Fabric

1. Configure the Fabric Apache Spark Diagnostic Emitter

2. Shape of the Raw Logs (and Why You’ll Normalize Them)

3. Capturing Query Plans in Fabric (Optional, but Powerful)

4. Capture Spark Config for Each Run

Part 2: Loading and Normalizing Spark Logs in a Fabric Lakehouse

Aggregating Stage Metrics in Fabric

Part 3: Adding a Rule Engine Inside Fabric

Part 4: Bringing in the LLM — Turning Metrics into Diagnosis

1. Prepare the Prompt Payload

2. Call an Azure AI Model from Fabric Spark

Part 5: What the Job Doctor’s Output Looks Like in Fabric

🔎 Issue 1: Skewed Stage 4 (skew ratio 12.3)

📦 Issue 2: Large Shuffle in Stage 2 (~19.7 GB)

💾 Issue 3: Spill to Disk (~1.8 GB) in Stage 3

Part 6: Stitching It All Together in Fabric

1. Telemetry Ingestion (Environment / Emitter)

2. Normalization Job (Spark / Data Pipeline)

3. AI Diagnosis Job (Spark + Azure AI Models)

4. User Experience

Why a Fabric-Specific Job Doctor Is Worth Building

Share this:

The Bright Edge: Why Passion Is Powerful

The Dark Edge: When Passion Starts Cutting You

The Subtle Trap: Passion as Justification

Holding the Sword by the Handle: Healthier Ways to Be Passionate

The Goal Isn’t to Be Less Passionate

Share this:

Introduction: Grading Under the Microscope

The Human Element: Trusted but Inconsistent

Published Standards, Unpredictable Results

A Hobby Tailor-Made for AI

Why the Slow Adoption? (Ulterior Motives?)

Upstart Efforts: Tech Takes on the Titans

PSA: Bigger Than Ever, But Is It Better?

The Case for Change: Consistency, Transparency, Trust

The ChatGPT Experiment: DIY Grading with AI

Embracing the Future: It’s Time for Change

Share this:

1. Setting the Table

2. Horizon #1 (≈ 2025-2030): The Co-Pilot Decade

3. Horizon #2 (≈ 2050): Partners in Intelligence

4. Horizon #3 (≈ 2125): Symbiosis or Overreach?

5. Managing the Transition

6. Closing Thoughts

Share this:

It’s Not Just Archaic—It’s Distracting

“But It’s Just a Synonym for While…”

The Data Doesn’t Lie

When You Use “Whilst” in American Writing, Here’s What Happens:

So Why Are People Still Using It?

Here’s a Radical Idea: Use “While”

Final Verdict

Share this:

Humble Beginnings and Early Promise