Understanding Spark Execution in Microsoft Fabric

Spark performance work is mostly execution work: understanding where the DAG splits into stages, where shuffles happen, and why a handful of tasks can dominate runtime.

This post is a quick, practical refresher on the Spark execution model — with Fabric-specific pointers on where to observe jobs, stages, and tasks.

1) The execution hierarchy: Application → Job → Stage → Task

In Spark, your code runs as a Spark application. When you run an action (for example, count(), collect(), or writing a table), Spark submits a job. Each job is broken into stages, and each stage runs a set of tasks (often one task per partition).

A useful mental model:

Tasks are the unit of parallel work.
Stages group tasks that can run together without needing data from another stage.
Stage boundaries often show up where a shuffle is required (wide dependencies like joins and aggregations).

2) Lazy evaluation: why “nothing happens” until an action

Most DataFrame / Spark SQL transformations are lazy. Spark builds a plan and only executes when an action forces it.

Example (PySpark):

from pyspark.sql.functions import col  df = spark.read.table("fact_sales") # Transformations (lazy) filtered = df.filter(col("sale_date") >= "2026-01-01")  # Action (executes) print(filtered.count())

This matters in Fabric notebooks because a single cell can trigger multiple jobs (for example, one job to materialize a cache and another to write output).

3) Shuffles: the moment your DAG turns expensive

A shuffle is when data must be redistributed across executors (typically by key). Shuffles introduce:

network transfer
disk I/O (shuffle files)
spill risk (memory pressure)
skew/stragglers (a few hot partitions dominate)

If you’re diagnosing a slow pipeline, assume a shuffle is the culprit until proven otherwise.

4) What to check in Fabric: jobs, stages, tasks

Fabric gives you multiple ways to see execution progress:

Notebook contextual monitoring: a progress indicator for notebook cells, with stage/task progress.
Spark monitoring / detail monitoring: drill into a Spark application and see jobs, stages, tasks, and duration breakdowns.

When looking at a slow run, focus on:

stages with large shuffle read/write
long-tail tasks (stragglers)
spill metrics (memory → disk)
skew indicators (a few tasks far slower than the median)

5) A repeatable debugging workflow (that scales)

Start with the plandf.explain(True) for DataFrame/Spark SQL
- Look for Exchange operators (shuffle) and join strategies (broadcast vs shuffle join)
Run once, then open monitoringIdentify the longest stage(s)
- Confirm whether it’s CPU-bound, shuffle-bound, or spill-bound
Apply the common fixes in orderAvoid the shuffle (broadcast small dims)
- Reduce shuffle volume (filter early, select only needed columns)
- Fix partitioning (repartition by join keys; avoid extreme partition counts)
- Turn on AQE (spark.sql.adaptive.enabled=true) to let Spark coalesce shuffle partitions and mitigate skew

Quick checklist

Do I know which stage is dominating runtime?
Is there an Exchange / shuffle boundary causing it?
Are a few tasks straggling (skew), or are all tasks uniformly slow?
Am I broadcasting what should be broadcast?
Is AQE enabled, and is it actually taking effect?

References

This post was written with help from ChatGPT 5.2

Fabric Spark Shuffle Tuning: AQE + partitions for Faster Joins

Shuffles are where Spark jobs go to get expensive: a wide join or aggregation forces data to move across the network, materialize shuffle files, and often spill when memory pressure spikes.

In Microsoft Fabric Spark workloads, the fastest optimization is usually the boring one: avoid the shuffle when you can, and when you can’t, make it smaller and better balanced.

This post lays out a practical, repeatable approach you can apply in Fabric notebooks and Spark job definitions.

1) Start with the simplest win: avoid the shuffle

If one side of your join is genuinely small (think lookup/dimension tables), use a broadcast join so Spark ships the small table to executors and avoids a full shuffle.

In Fabric’s Spark best practices, Microsoft explicitly calls out broadcast joins for small lookup tables as a way to avoid shuffles entirely.

Example (PySpark):

from pyspark.sql.functions import broadcast

fact = spark.read.table("fact_sales")
dim  = spark.read.table("dim_product")

# If dim_product is small enough, broadcast it
joined = fact.join(broadcast(dim), on="product_id", how="left")

If you can’t broadcast safely, move to the next lever.

2) Make the shuffle less painful: tune shuffle parallelism

Spark controls the number of shuffle partitions for joins and aggregations with spark.sql.shuffle.partitions (default: 200 in Spark SQL).

Too few partitions → huge partitions → long tasks, spills, and stragglers.
Too many partitions → tiny tasks → scheduling overhead, excess shuffle metadata, and unnecessary overhead.

Example (session-level setting):

spark.conf.set("spark.sql.shuffle.partitions", "400")

A decent heuristic is to start with something proportional to total executor cores and then iterate using the Spark UI (watch stage task durations, shuffle read/write sizes, and spill metrics).

3) Let Spark fix itself (when it can): enable AQE

Adaptive Query Execution (AQE) uses runtime statistics to optimize a query as it runs.

Fabric’s Spark best practices recommend enabling AQE to dynamically optimize shuffle partitions and handle skewed data automatically.

AQE is particularly helpful when:

Your input data distribution changes day-to-day
A static spark.sql.shuffle.partitions value is right for some workloads but wrong for others
You hit skew where a small number of partitions do most of the work

Example:

spark.conf.set("spark.sql.adaptive.enabled", "true")

4) Diagnose like you mean it: what to look for in Spark UI

When a job is slow, treat it like a shuffle problem until proven otherwise.

Look for:

Stages where a handful of tasks take dramatically longer than the median (classic skew)
Large shuffle read/write sizes concentrated in a small number of partitions
Spill (memory → disk) spikes during joins/aggregations

When you see skew, your options are usually:

Broadcast (if feasible)
Repartition on a better key
Salt hot keys (advanced)
Enable AQE and confirm it’s actually taking effect

A minimal checklist for Fabric Spark teams

Use DataFrame APIs (keep Catalyst in play).
Broadcast small lookup tables to avoid shuffles.
Set a sane baseline for spark.sql.shuffle.partitions.
Enable AQE and validate in the query plan / UI.
Iterate with the Spark UI: measure, change one thing, re-measure.

References

This post was written with help from ChatGPT 5.2

Sparkwise: an “automated data engineering specialist” for Fabric Spark tuning

Spark tuning has a way of chewing up time: you start with something that “should be fine,” performance is off, costs creep up, and suddenly you’re deep in configs, Spark UI, and tribal knowledge trying to figure out what actually matters.

That’s why I’m excited to highlight sparkwise, an open-source Python package created by Santhosh Kumar Ravindran, one of my direct reports here at Microsoft. Santhosh built sparkwise to make Spark optimization in Microsoft Fabric less like folklore and more like a repeatable workflow: automated diagnostics, session profiling, and actionable recommendations to help teams drive better price-performance without turning every run into an investigation.

If you’ve ever thought, “I know something’s wrong, but I can’t quickly prove what to change,” sparkwise is aimed squarely at that gap. (PyPI)

As of January 5, 2026, the latest release is sparkwise 1.4.2 on PyPI. (PyPI)

The core idea: stop guessing, start diagnosing

Spark tuning often fails for two reasons:

Too many knobs (Spark, Delta, Fabric-specific settings, runtime behavior).
Not enough feedback (it’s hard to translate symptoms into the few changes that actually matter).

sparkwise attacks both.

It positions itself as an “automated Data Engineering specialist for Apache Spark on Microsoft Fabric,” offering:

Intelligent diagnostics
Configuration recommendations
Comprehensive session profiling
…so you can get to the best price/performance outcome without turning every notebook run into a science project. (PyPI)

Why sparkwise exists (and the problems it explicitly targets)

From the project description, sparkwise focuses on the stuff that reliably burns time and money in real Fabric Spark work:

Cost optimization: detect configurations that waste capacity and extend runtime (PyPI)
Performance optimization: validate and enable Fabric-specific acceleration paths like Native Engine and resource profiles (PyPI)
Faster iteration: detect Starter Pool blockers that force slower cold starts (3–5 minutes is called out directly) (PyPI)
Learning & clarity: interactive Q&A across 133 Spark/Delta/Fabric configurations (PyPI)
Workload understanding: profiling across sessions, executors, jobs, and resources (PyPI)
Decision support: priority-ranked recommendations with impact analysis (PyPI)

If you’ve ever thought “I know something is off, but I can’t prove which change matters,” this is aimed squarely at you.

What you get: a feature tour that maps to real-world Spark pain

sparkwise’s feature set is broad, but it’s not random. It clusters nicely into a few “jobs to be done.”

1) Automated diagnostics (the fast “what’s wrong?” pass)

The diagnostics layer checks a bunch of high-impact areas, including:

Native Execution Engine: verifies Velox usage and detects fallbacks to row-based processing (PyPI)
Spark compute: analyzes Starter vs Custom Pool usage and flags immutable configs (PyPI)
Data skew detection: identifies imbalanced task distributions (PyPI)
Delta optimizations: checks V-Order, deletion vectors, optimize write, auto compaction (PyPI)
Runtime tuning: validates AQE, partition sizing, scheduler mode (PyPI)

This is the stuff that tends to produce outsized wins when it’s wrong.

2) Comprehensive profiling (the “what actually happened?” pass)

Once you’re past basic correctness, the next level is: where did time and resources go?

sparkwise includes profiling across:

session metadata and resource allocation
executor status and memory utilization
job/stage/task metrics and bottleneck detection
resource efficiency scoring and utilization analysis (PyPI)

3) Advanced performance analysis (built on real metrics)

One of the most interesting “newer” directions in sparkwise is leaning into actual observed execution metrics:

“Real metrics collection” using Spark stage/task data (vs estimates) (PyPI)
scalability prediction comparing Starter vs Custom Pool with vCore-hour calculations (PyPI)
stage timeline visualization (parallel vs sequential patterns) (PyPI)
efficiency analysis that quantifies wasted compute in vCore-hours (PyPI)

That’s the bridge between “it feels slow” and “here’s the measurable waste + the fix.”

4) Advanced skew detection (because skew kills Spark)

Skew is one of those problems that can hide behind averages and ruin everything anyway.

sparkwise’s skew tooling includes:

straggler detection via task duration variance (PyPI)
partition-level analysis with statistical metrics (PyPI)
skewed join detection with mitigation suggestions (broadcast vs salting strategies) (PyPI)
automatic mitigation guidance with code examples (salting, AQE, broadcast) (PyPI)

5) SQL query plan analysis (spotting anti-patterns early)

For teams living in Spark SQL / DataFrames, this is huge:

anti-pattern detection (cartesian products, full scans, excessive shuffles) (PyPI)
Native Engine compatibility checks (PyPI)
Z-Order recommendations based on cardinality (PyPI)
caching opportunity detection for repeated scans/subqueries (PyPI)

6) Storage optimization suite (new in v1.4.0+)

This is one of the clearest “practical ops” expansions:

small file detection for Delta tables (default threshold is configurable; example shows <10MB) (PyPI)
VACUUM ROI calculator using OneLake pricing assumptions in the project docs (PyPI)
partition effectiveness analysis and over/under-partitioning detection (PyPI)
“run all storage checks in one command” workflows (PyPI)

In other words: not just “your table is messy,” but “here’s why it costs you, and what to do.”

7) Interactive configuration assistant (the “what does this do?” superpower)

This is deceptively valuable. sparkwise provides:

Q&A for 133 documented configurations spanning Spark, Delta, Fabric-specific settings (and Runtime 1.2 configs are called out) (PyPI)
context-aware guidance with workload-specific recommendations (PyPI)
explicit support for Fabric resource profiles (writeHeavy, readHeavyForSpark, readHeavyForPBI) (PyPI)
keyword search across config knowledge (PyPI)

This is the difference between “go read 9 docs” and “ask one question and move on.”

Quick start: the 3 fastest ways to get value

Install

pip install sparkwise

(PyPI)

1) Run a full diagnostic on your current session

from sparkwise import diagnose

diagnose.analyze()

(PyPI)

2) Ask about a specific Spark/Fabric config

from sparkwise import ask

ask.config("spark.native.enabled")
ask.search("optimize")

(PyPI)

3) Profile your run (and pinpoint bottlenecks)

from sparkwise import (
    profile, profile_executors, profile_jobs, profile_resources,
    predict_scalability, show_timeline, analyze_efficiency
)

profile()
profile_executors()
profile_jobs()
profile_resources()

predict_scalability()
show_timeline()
analyze_efficiency()

(PyPI)

CLI workflows (especially useful for storage optimization)

If you prefer CLIs (or want repeatable checks in scripts), sparkwise includes commands like:

sparkwise storage analyze Tables/mytable
sparkwise storage small-files Tables/mytable --threshold 10
sparkwise storage vacuum-roi Tables/mytable --retention-hours 168
sparkwise storage partitions Tables/mytable

(PyPI)

That’s a clean “ops loop” for keeping Delta tables healthy.

A realistic “first hour” workflow I’d recommend

If you’re trying sparkwise on a real Fabric notebook today, here’s a practical order of operations:

Run diagnose.analyze() first
Use it as your “triage” to catch the high-impact misconfigs (Native Engine fallback, AQE off, Starter Pool blockers). (PyPI)
Use ask.config() for any red/yellow item you don’t fully understand
The point is speed: read the explanation in context and decide. (PyPI)
Profile the session
If the job is still slow/expensive after obvious fixes, profile and look for the real culprit: skew, shuffle pressure, poor parallelism, memory pressure. (PyPI)
If the job smells like skew, use advanced skew detection
Especially for joins and wide aggregations. (PyPI)
If your tables are growing, run storage analysis early
Small files and weak partitioning quietly tax everything downstream. (PyPI)

That flow is how you turn “tuning” from an art project into a checklist.

Closing: why this matters for Fabric teams

I’m amplifying sparkwise because it’s the kind of contribution that scales beyond the person who wrote it. Santhosh took hard-earned, real-world Fabric Spark tuning experience and turned it into something other engineers can use immediately — a practical way to spot waste, unblock faster iteration, and make smarter performance tradeoffs.

If your team runs Fabric Spark workloads regularly, treat sparkwise like a lightweight tuning partner:

install it,
run the diagnostics,
act on one recommendation,
measure the improvement,
repeat.

And if you end up with feedback or feature ideas, even better — that’s how tools like this get sharper and more broadly useful.

This post was written with help from ChatGPT 5.2

Build Your Own Spark Job Doctor in Microsoft Fabric

Microsoft Fabric makes it incredibly easy to spin up Spark workloads: notebooks, Lakehouse pipelines, dataflows, SQL + Spark hybrid architectures—the whole buffet.

What’s still hard?
Knowing why a given Spark job is slow, expensive, or flaky.

A Lakehouse pipeline starts timing out.
A notebook that used to finish in 5 minutes is now taking 25.
Costs spike because one model training job is shuffling half the lake.

You open the Spark UI, click around a few stages, stare at shuffle graphs, and say the traditional words of Spark debugging:

“Huh.”

This is where an AI assistant should exist.

In this post, we’ll walk through how to build exactly that for Fabric Spark: a Job Doctor that:

Reads Spark telemetry from your Fabric environment
Detects issues like skew, large shuffles, spill, and bad configuration
Uses a large language model (LLM) to explain what went wrong
Produces copy-pasteable fixes in Fabric notebooks / pipelines
Runs inside Fabric using Lakehouses, notebooks, and Azure AI models

This is not a fake product announcement. This is a blueprint you can actually build.

What Is the Fabric “Job Doctor”?

At a high level, the Job Doctor is:

A Fabric-native analytics + AI layer that continuously reads Spark job history, detects common performance anti-patterns, and generates human-readable, prescriptive recommendations.

Concretely, it does three main things:

Collects Spark job telemetry from Fabric
- Spark application metrics (tasks, stages, shuffles, spills)
- Spark logs & events (Driver/Executor/Event logs)
- Optional query plans
- Spark session configs
Analyzes jobs using rules + metrics
- Identifies skew, large shuffles, spill, etc.
- Scores each job run and surfaces the top issues.
Uses an LLM to generate a “diagnosis sheet”
- Root cause in plain English
- Fixes with code + config snippets for Fabric Spark
- Expected impact on performance/cost

Let’s build it step by step, Fabric-style.

Part 1: Getting Spark Telemetry Out of Fabric

Before you can diagnose anything, you need the raw signals. In Fabric, there are three main ways to see what Spark is doing:

Fabric Apache Spark diagnostic emitter → logs/metrics for each application
Spark application details (UI / REST)
In-job logging from notebooks/pipelines (e.g., configs, query plans)

You don’t have to use all three, but combining them gives you enough for a very capable Job Doctor.

1. Configure the Fabric Apache Spark Diagnostic Emitter

The core telemetry pipeline starts with the Fabric Apache Spark diagnostic emitter, configured on a Fabric environment.

At a high level, you:

Create or use an environment for your Spark workloads.
Configure one or more diagnostic emitters on that environment.
Point each emitter to a sink such as:
- Azure Storage (Blob, ADLS)
- Azure Log Analytics
- Azure Event Hubs

For example, an emitter to Azure Storage might be configured (conceptually) like this:

spark.synapse.diagnostic.emitters: MyStorageEmitter
spark.synapse.diagnostic.emitter.MyStorageEmitter.type: AzureStorage
spark.synapse.diagnostic.emitter.MyStorageEmitter.categories: DriverLog,ExecutorLog,EventLog,Metrics
spark.synapse.diagnostic.emitter.MyStorageEmitter.uri: https://<account>.blob.core.windows.net/<container>/<folder>
spark.synapse.diagnostic.emitter.MyStorageEmitter.auth: AccessKey
spark.synapse.diagnostic.emitter.MyStorageEmitter.secret: <storage-access-key>

Once this is in place:

Every Spark application (notebook, job, pipeline activity that spins up Spark) will emit diagnostic records.
Those records land as JSON lines describing driver logs, executor logs, Spark listener events, and metrics.

From there, you can:

If using Storage: Create a shortcut in a Lakehouse pointing at the container/folder.
If using Log Analytics: Build KQL queries or export into Fabric (e.g., into a KQL DB or as files you later hydrate into a Lakehouse).

We’ll assume the storage pattern for the rest of this post:

Spark app → Fabric environment with diagnostic emitter → Azure Storage → OneLake shortcut → Lakehouse.

2. Shape of the Raw Logs (and Why You’ll Normalize Them)

The emitter doesn’t give you a nice stageId / taskId table out of the box. Instead, you’ll see records like:

{
  "timestamp": "2024-05-01T12:34:56Z",
  "category": "Metrics",
  "applicationId": "app-20240501123456-0001",
  "properties": {
    "metricName": "executorRunTime",
    "stageId": 4,
    "taskId": 123,
    "value": 9182,
    "otherFields": "..."
  }
}

Or an EventLog record with a payload that looks like the Spark listener event.

To build a Job Doctor, you’ll:

Read the JSON lines into Fabric Spark
Explode / parse the properties payload
Aggregate per-task metrics into per-stage metrics for each application

We’ll skip the exact parsing details (they depend on how you set up the emitter and which events/metrics you enable) and assume that after a normalization job, you have a table with one row per (applicationId, stageId, taskId).

That’s what the next sections use.

3. Capturing Query Plans in Fabric (Optional, but Powerful)

Spark query plans are gold when you’re trying to answer why a stage created a huge shuffle or why a broadcast join didn’t happen.

There isn’t yet a first-class “export query plan as JSON” API in PySpark, but in Fabric notebooks you can use a (semi-internal) trick that works today:

import json

df = ...  # some DataFrame you care about

# Advanced / internal: works today but isn't a public, stable API
plan_json = json.loads(df._jdf.queryExecution().toJSON())

You can also log the human-readable plan:

df.explain(mode="formatted")  # documented mode, prints a detailed plan

To persist the JSON plan for the Job Doctor, tie it to the Spark application ID:

from pyspark.sql import Row

app_id = spark.sparkContext.applicationId

spark.createDataFrame(
    [Row(applicationId=app_id, query_plan_json=plan_json)]
).write.mode("append").saveAsTable("job_doctor.query_plans")

A couple of caveats you should mention in a real blog:

_jdf.queryExecution().toJSON() is not guaranteed to be stable across Spark versions. It’s an advanced, “use at your own risk” trick.
You don’t need to capture plans for every single query—just key bottleneck notebooks or critical pipelines.

Even capturing a subset massively improves the quality of LLM explanations.

4. Capture Spark Config for Each Run

Fabric Spark lets you set configs at:

Environment / pool level (resource profiles, environment settings)
Notebook / job level (spark.conf.set(...))
Pipeline activity level (Spark job settings)

Inside the running Spark job, you can capture the effective session config like this:

from pyspark.sql import Row

app_id = spark.sparkContext.applicationId
conf_dict = dict(spark.conf.getAll())  # session-level config

config_rows = [
    Row(applicationId=app_id, key=k, value=v)
    for k, v in conf_dict.items()
]

spark.createDataFrame(config_rows).write.mode("append").saveAsTable("job_doctor.spark_conf")

Now the Job Doctor can say things like:

“AQE was disabled for this job.”
“Shuffle partitions was left at default 200, which is low for your data size.”

You’re building a small “Job Doctor mart” inside Fabric:

job_doctor.raw_logs (from emitter)
job_doctor.stage_metrics (aggregated)
job_doctor.stage_issues (rule engine output)
job_doctor.spark_conf (per-application configs)
job_doctor.query_plans (optional)

All keyed by applicationId.

Part 2: Loading and Normalizing Spark Logs in a Fabric Lakehouse

Let’s assume you’ve done one-time wiring:

Azure Storage container with Spark diagnostics
OneLake shortcut from that container into a Lakehouse
A Fabric Spark notebook attached to that Lakehouse

From that notebook:

logs_df = spark.read.json("Tables/spark_diagnostics_raw")  # or your shortcut path
display(logs_df.limit(10))

You’ll see something like:

timestamp
category (DriverLog, ExecutorLog, EventLog, Metrics, …)
applicationId
properties (nested JSON with stage/task/metric detail)

The normalization step (which you can run as a scheduled pipeline) should:

Filter down to metrics/events relevant for performance (e.g. task / stage metrics)
Extract stageId, taskId, executorRunTime, shuffleReadBytes, etc., into top-level columns
Persist the result as job_doctor.task_metrics (or similar)

For the rest of this post, we’ll assume you’ve already done that and have a table with columns:

applicationId
stageId
taskId
executorRunTime
shuffleReadBytes
shuffleWriteBytes
memoryBytesSpilled
diskBytesSpilled

Aggregating Stage Metrics in Fabric

Now we want to collapse per-task metrics into per-stage metrics per application.

In a Fabric notebook:

from pyspark.sql import functions as F

task_metrics = spark.table("job_doctor.task_metrics")

stage_metrics = (
    task_metrics
    .groupBy("applicationId", "stageId")
    .agg(
        F.countDistinct("taskId").alias("num_tasks"),
        F.sum("executorRunTime").alias("total_task_runtime_ms"),
        # Depending on Spark version, you may need percentile_approx instead
        F.expr("percentile(executorRunTime, 0.95)").alias("p95_task_runtime_ms"),
        F.max("executorRunTime").alias("max_task_runtime_ms"),
        F.sum("shuffleReadBytes").alias("shuffle_read_bytes"),
        F.sum("shuffleWriteBytes").alias("shuffle_write_bytes"),
        F.sum("memoryBytesSpilled").alias("memory_spill_bytes"),
        F.sum("diskBytesSpilled").alias("disk_spill_bytes"),
    )
    .withColumn(
        "skew_ratio",
        F.col("max_task_runtime_ms") /
        F.when(F.col("p95_task_runtime_ms") == 0, 1).otherwise(F.col("p95_task_runtime_ms"))
    )
    .withColumn("shuffle_read_mb", F.col("shuffle_read_bytes") / (1024**2))
    .withColumn("shuffle_write_mb", F.col("shuffle_write_bytes") / (1024**2))
    .withColumn(
        "spill_mb",
        (F.col("memory_spill_bytes") + F.col("disk_spill_bytes")) / (1024**2)
    )
)

stage_metrics.write.mode("overwrite").saveAsTable("job_doctor.stage_metrics")

This gives you a Fabric Lakehouse table with:

skew_ratio
shuffle_read_mb
shuffle_write_mb
spill_mb
p95_task_runtime_ms
num_tasks, total_task_runtime_ms, etc.

You can run this notebook:

On a schedule via a Data Pipeline
Or as a Data Engineering job configured in the workspace

Part 3: Adding a Rule Engine Inside Fabric

Now that the metrics are in a Lakehouse table, let’s add a simple rule engine in Python.

This will run in a Fabric notebook (or job) and write out issues per stage.

from pyspark.sql import Row, functions as F

stage_metrics = spark.table("job_doctor.stage_metrics")

# For simplicity, we'll collect to the driver here.
# This is fine if you don't have thousands of stages.
# For very large workloads, you'd instead do this via a UDF / mapInPandas / explode.
stage_rows = stage_metrics.collect()

Define some basic rules:

def detect_issues(stage_row):
    issues = []

    # 1. Skew detection
    if stage_row.skew_ratio and stage_row.skew_ratio > 5:
        issues.append({
            "issue_id": "SKEWED_STAGE",
            "severity": "High",
            "details": f"Skew ratio {stage_row.skew_ratio:.1f}"
        })

    # 2. Large shuffle
    total_shuffle_mb = (stage_row.shuffle_read_mb or 0) + (stage_row.shuffle_write_mb or 0)
    if total_shuffle_mb > 10_000:  # > 10 GB
        issues.append({
            "issue_id": "LARGE_SHUFFLE",
            "severity": "High",
            "details": f"Total shuffle {total_shuffle_mb:.1f} MB"
        })

    # 3. Excessive spill
    if (stage_row.spill_mb or 0) > 1_000:  # > 1 GB
        issues.append({
            "issue_id": "EXCESSIVE_SPILL",
            "severity": "Medium",
            "details": f"Spill {stage_row.spill_mb:.1f} MB"
        })

    return issues

Apply the rules and persist the output:

issue_rows = []

for r in stage_rows:
    for issue in detect_issues(r):
        issue_rows.append(Row(
            applicationId=r.applicationId,
            stageId=r.stageId,
            issue_id=issue["issue_id"],
            severity=issue["severity"],
            details=issue["details"]
        ))

issues_df = spark.createDataFrame(issue_rows)

issues_df.write.mode("overwrite").saveAsTable("job_doctor.stage_issues")

Now you have a table of Spark issues detected per run inside your Lakehouse.

Later, the LLM will use these as structured hints.

Part 4: Bringing in the LLM — Turning Metrics into Diagnosis

So far, everything has been pure Spark in Fabric.

Now we want a model (e.g., Azure AI “Models as a Service” endpoint or Azure OpenAI) to turn:

job_doctor.stage_metrics
job_doctor.stage_issues
job_doctor.spark_conf
job_doctor.query_plans

into an actual diagnosis sheet a human can act on.

In Fabric, this is simplest from a Spark notebook using a Python HTTP client.

Below, I’ll show the pattern using an Azure AI serverless model endpoint (the one that uses model: "gpt-4.1" in the body).

1. Prepare the Prompt Payload

First, fetch the data for a single Spark application:

import json
from pyspark.sql import functions as F

app_id = "app-20240501123456-0001"  # however you pick which run to diagnose

stages_df = spark.table("job_doctor.stage_metrics").where(F.col("applicationId") == app_id)
issues_df = spark.table("job_doctor.stage_issues").where(F.col("applicationId") == app_id)
conf_df   = spark.table("job_doctor.spark_conf").where(F.col("applicationId") == app_id)
plans_df  = spark.table("job_doctor.query_plans").where(F.col("applicationId") == app_id)

stages_json = stages_df.toPandas().to_dict(orient="records")
issues_json = issues_df.toPandas().to_dict(orient="records")
conf_json   = conf_df.toPandas().to_dict(orient="records")
plans_json  = plans_df.toPandas().to_dict(orient="records")  # likely 0 or 1 row

Then build a compact but informative prompt:

prompt = f"""
You are an expert in optimizing Apache Spark jobs running on Microsoft Fabric.

Here is summarized telemetry for one Spark application (applicationId={app_id}):

Stage metrics (JSON):
{json.dumps(stages_json, indent=2)}

Detected issues (JSON):
{json.dumps(issues_json, indent=2)}

Spark configuration (key/value list):
{json.dumps(conf_json, indent=2)}

Query plans (optional, may be empty):
{json.dumps(plans_json, indent=2)}

Your tasks:
1. Identify the top 3–5 performance issues for this run.
2. For each, explain the root cause in plain language.
3. Provide concrete fixes tailored for Fabric Spark, including:
   - spark.conf settings (for notebooks/jobs)
   - suggestions for pipeline settings where relevant
   - SQL/DataFrame code snippets
4. Estimate likely performance impact (e.g., "30–50% reduction in runtime").
5. Call out any risky or unsafe changes that should be tested carefully.

Return your answer as markdown.
"""

2. Call an Azure AI Model from Fabric Spark

For the serverless “Models as a Service” endpoint, the pattern looks like this:

import os
import requests

# Example: using Azure AI Models as a Service
# AZURE_AI_ENDPOINT might look like: https://models.inference.ai.azure.com
AZURE_AI_ENDPOINT = os.environ["AZURE_AI_ENDPOINT"]
AZURE_AI_KEY      = os.environ["AZURE_AI_KEY"]

MODEL = "gpt-4.1"  # or whatever model you've enabled

headers = {
    "Content-Type": "application/json",
    "api-key": AZURE_AI_KEY,
}

body = {
    "model": MODEL,
    "messages": [
        {"role": "system", "content": "You are a helpful assistant for optimizing Spark jobs on Microsoft Fabric."},
        {"role": "user", "content": prompt},
    ],
}

resp = requests.post(
    f"{AZURE_AI_ENDPOINT}/openai/chat/completions",
    headers=headers,
    json=body,
)

resp.raise_for_status()
diagnosis = resp.json()["choices"][0]["message"]["content"]

If you instead use a provisioned Azure OpenAI resource, the URL shape is slightly different (you call /openai/deployments/<deploymentName>/chat/completions and omit the model field), but the rest of the logic is identical.

At this point, diagnosis is markdown you can:

Render inline in the notebook with displayHTML
Save into a Lakehouse table
Feed into a Fabric semantic model for reporting

Part 5: What the Job Doctor’s Output Looks Like in Fabric

A good Job Doctor output for Fabric Spark might look like this (simplified):

🔎 Issue 1: Skewed Stage 4 (skew ratio 12.3)

What I see

Stage 4 has a skew ratio of 12.3 (max task runtime vs. p95).
This stage also reads ~18.2 GB via shuffle, which amplifies the imbalance.

Likely root cause

A join or aggregation keyed on a column where a few values dominate (e.g. a “default” ID, nulls, or a small set of hot keys). One partition ends up doing far more work than the others.

Fabric-specific fixes

In your notebook or job settings, enable Adaptive Query Execution and skew join handling:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

If the query is in SQL (Lakehouse SQL endpoint), enable AQE at the session/job level through Spark configuration.

If one side of the join is a small dimension table, add a broadcast hint:

SELECT /*+ BROADCAST(dim) */ f.*
FROM fact f
JOIN dim
  ON f.key = dim.key;

Estimated impact:
30–50% reduction in total job runtime, depending on how skewed the key distribution is.

📦 Issue 2: Large Shuffle in Stage 2 (~19.7 GB)

What I see

Stage 2 reads ~19.7 GB via shuffle.
Shuffle partitions are set to 200 (Spark default).

Likely root cause

A join or aggregation is shuffling nearly the full dataset, but parallelism is low given the data volume. That leads to heavy tasks and increased risk of spill.

Fabric-specific fixes

Increase shuffle partitions for this job:

spark.conf.set("spark.sql.shuffle.partitions", "400")

For pipelines, set this at the Spark activity level under Spark configuration, or through your Fabric environment’s resource profile if you want a new default.

Also consider partitioning by the join key earlier in the pipeline:

df = df.repartition("customer_id")

Estimated impact:
More stable runtimes and reduced likelihood of spill; wall-clock improvements if your underlying capacity has enough cores.

💾 Issue 3: Spill to Disk (~1.8 GB) in Stage 3

What I see

Stage 3 spills ~1.8 GB to disk.
This correlates with under-parallelism or memory pressure.

Fabric-specific fixes

Adjust cluster sizing via Fabric capacity / resource profiles (enough cores + memory per core).
Increase spark.sql.shuffle.partitions as above.
Avoid wide transformations producing huge intermediate rows early in the job; materialize smaller, more selective intermediates first.

You can persist the diagnosis text into a table:

from pyspark.sql import Row

spark.createDataFrame(
    [Row(applicationId=app_id, diagnosis_markdown=diagnosis)]
).write.mode("append").saveAsTable("job_doctor.diagnoses")

Then you can build a Power BI report in Fabric bound to:

job_doctor.diagnoses
job_doctor.stage_metrics
job_doctor.stage_issues

to create a “Spark Job Health” dashboard where:

Rows = recent Spark runs
Columns = severity, duration, shuffle size, spill, etc.
A click opens the AI-generated diagnosis for that run

All inside the same workspace.

Part 6: Stitching It All Together in Fabric

Let’s recap the full Fabric-native architecture.

1. Telemetry Ingestion (Environment / Emitter)

Configure a Fabric environment for your Spark workloads.
Add a Fabric Apache Spark diagnostic emitter to send logs/metrics to:
- Azure Storage (for Lakehouse shortcuts), or
- Log Analytics / Event Hubs if you prefer KQL or streaming paths.
(Optional) From notebooks/pipelines, capture:
- Spark configs → job_doctor.spark_conf
- Query plans → job_doctor.query_plans

2. Normalization Job (Spark / Data Pipeline)

Read raw diagnostics from Storage via a Lakehouse shortcut.
Parse and flatten the records into per-task metrics.
Aggregate per-stage metrics → job_doctor.stage_metrics.
Evaluate rule engine → job_doctor.stage_issues.
Persist all of this into Lakehouse tables.

3. AI Diagnosis Job (Spark + Azure AI Models)

For each new (or most expensive / slowest) application:
- Pull stage metrics, issues, configs, and query plans from Lakehouse.
- Construct a structured prompt.
- Call your Azure AI / Azure OpenAI endpoint from a Fabric Spark notebook.
- Store the markdown diagnosis in job_doctor.diagnoses.

4. User Experience

Fabric Notebook
- A “Run Job Doctor” cell or button that takes applicationId, calls the model, and displays the markdown inline.
Data Pipeline / Job
- Scheduled daily to scan all runs from yesterday and generate diagnoses automatically.
Power BI Report in Fabric
- “Spark Job Health” dashboard showing:
  - Top slowest/most expensive jobs
  - Detected issues (skew, large shuffle, spill, config problems)
  - AI recommendations, side-by-side with raw metrics

Everything lives in one Fabric workspace, using:

Lakehouses for data
Spark notebooks / pipelines for processing
Azure AI models for reasoning
Power BI for visualization

Why a Fabric-Specific Job Doctor Is Worth Building

Spark is Spark, but in Fabric the story is different:

Spark jobs are tied closely to Lakehouses, Pipelines, Dataflows, and Power BI.
You already have a single control plane for capacity, governance, cost, and monitoring.
Logs, metrics, and reports can live right next to the workloads they describe.

That makes Fabric an ideal home for a Job Doctor:

No extra infrastructure to stand up
No random side services to glue together
The telemetry you need is already flowing; you just have to catch and shape it
AI can sit directly on top of your Lakehouse + monitoring data

With some Spark, a few Lakehouse tables, and an LLM, you can give every data engineer and analyst in your organization a “Spark performance expert” that’s always on call.

I’ve included a sample notebook you can use to get started on your Job Doctor today!

This post was created with help from (and suggested to me) by ChatGPT Pro using the 5.1 Thinking Model

Microsoft Fabric Capacity Management: A Comprehensive Guide for Administrators (using ChatGPT’s Deep Research)

Author’s note – I have enjoyed playing around with the Deep Research capabilities of ChatGPT, and I had it put together what it felt was the definitive whitepaper on Capacity Management for Microsoft Fabric. It basically just used the Microsoft documentation (plus a couple of community posts) to pull it together, so I’m curious what you think. I’ll leave a link to download the PDF copy of this at the end of the post.

Executive Summary

Microsoft Fabric capacities provide the foundational compute resources that power the Fabric analytics platform. They are essentially dedicated pools of compute (measured in Capacity Units or CUs) allocated to an organization’s Microsoft Fabric tenant. Proper capacity management is crucial for ensuring reliable performance, supporting all Fabric workloads (Power BI, Data Engineering, Data Science, Real-Time Analytics, etc.), and optimizing costs. This white paper introduces capacity and tenant administrators to the full spectrum of Fabric capacity management – from basic concepts to advanced strategies.

Key takeaways: Fabric offers multiple capacity SKUs (F, P, A, EM, Trial) with differing capabilities and licensing models. Understanding these SKU types and how to provision them is the first step. Once a capacity is in place, administrators must plan and size it appropriately to meet workload demands without over-provisioning. All Fabric experiences share capacity resources, so effective workload management and governance are needed to prevent any one workload from overwhelming others. Fabric’s capacity model introduces bursting and smoothing to handle short-term peaks, while throttling mechanisms protect the system during sustained overloads. Tools like the Fabric Capacity Metrics App provide visibility into utilization and help with monitoring performance and identifying bottlenecks. Administrators should leverage features such as autoscale options (manual or scripted scaling and Spark auto-scaling), notifications, and the new surge protection to manage peak loads and maintain service levels.

Effective capacity management also involves governance practices: assigning workspaces to capacities in a thoughtful way, isolating critical workloads, and controlling who can create or consume capacity resources. Cost optimization is a continuous concern – this paper discusses strategies like pausing capacities during idle periods, choosing the right SKU size (and switching to reserved pricing for savings), and using per-user licensing (Premium Per User) when appropriate to minimize costs. Finally, we present real-world scenarios with recommendations to illustrate how organizations can mix and match these approaches. By following the guidance in this document, new administrators will be equipped to manage Microsoft Fabric capacities confidently and get the most value from their analytics investment.

Introduction to Microsoft Fabric Capacities

Microsoft Fabric is a unified analytics platform that spans data integration, data engineering, data warehousing, data science, real-time analytics, and business intelligence (Power BI). A Microsoft Fabric capacity is a dedicated set of cloud resources (compute memory/CPU) allocated to a tenant to run these analytics workloads. In essence, a capacity represents a chunk of “always-on” compute power measured in Capacity Units (CUs) that your organization owns or subscribes to. The capacity’s size (number of CUs) determines how much computational load it can handle at any given time.

Why capacities matter: Certain Fabric features and collaborative capabilities are only available when content is hosted in a capacity. For example, to share Power BI reports broadly without requiring per-user licenses, or to use advanced Fabric services like Spark notebooks, data warehouses, and real-time analytics, you must use a Fabric capacity. Capacities enable organization-wide sharing, collaboration, and performance guarantees beyond the limits of individual workstations or ad-hoc cloud resources. They act as containers for workspaces – any workspace assigned to a capacity will run all its workload (reports, datasets, pipelines, notebooks, etc.) on that capacity’s resources. This provides predictable performance and isolation: one team’s heavy data science experiment in their capacity won’t consume resources needed by another team’s dashboards on a different capacity. It also simplifies administration – instead of managing separate compute for each project, admins manage pools of capacity that can host many projects.

In summary, Fabric capacities are the backbone of a Fabric deployment, combining compute isolation, performance scaling, and licensing benefits. With a capacity, your organization can create and share Fabric content (from Power BI reports to AI models) with the assurance of dedicated resources and without every user needing a premium license. The rest of this document will explore how to choose the right capacity, configure it for various workloads, keep it running optimally, and do so cost-effectively.

Capacity SKU Types and Differences (F, P, A, EM, Trial)

Microsoft Fabric builds on the legacy of Power BI’s capacity-based licensing, introducing new Fabric (F) SKUs alongside existing Premium (P) and Embedded SKUs. It’s important for admins to understand the types of capacity SKUs available and their differences:

F-SKUs (Fabric SKUs): These are the new* capacity units introduced with Microsoft Fabric. They are purchased through Azure and measured in Capacity Units (CUs). F-SKUs range from small to very large (F2 up to F2048), each providing a set number of CUs (e.g. F2 = 2 CUs, F64 = 64 CUs, etc.). F-SKUs support all Fabric workloads (Power BI content and the new Fabric experiences like Lakehouse, Warehouse, Spark, etc.). They offer flexible cloud purchasing (hourly pay-as-you-go billing with the ability to pause when not in use) and scaling options. Microsoft is encouraging customers to adopt F-SKUs for Fabric due to their flexibility in scaling and billing.
P-SKUs (Power BI Premium per Capacity): These were the traditional Power BI Premium capacities (P1 through P5) bought via the Microsoft 365 admin center with an annual subscription commitment. P-SKUs also support the full Fabric feature set (they have been migrated onto the Fabric backend). However, as of mid-2024, Microsoft has deprecated new purchases of P-SKUs in favor of F-SKUs. Organizations with existing P capacities can use Fabric on them, but new capacity purchases should be F-SKUs going forward. One distinction is that P-SKUs cannot be paused and were billed as fixed annual licenses (less flexible, but previously lower cost for constant use).
A-SKUs (Azure Power BI Embedded): These are Azure-purchased capacities originally meant for Power BI embedded analytics scenarios. They correspond to the same resource levels as some F-SKUs (for example, A4 is equivalent to an F64 in compute power) but only support Power BI workloads – they do not support the new Fabric experiences like Spark or data engineering. A-SKUs can still be used if you only need Power BI (for example, for embedding reports in a web app), but if any Fabric features are needed, you must use an F or P SKU.
EM-SKUs (Power BI Embedded for organization): Another variant of embedded capacity (EM1, EM2, EM3) which are lower-tier and were used for internal “Embedded” scenarios (like embedding Power BI content in SharePoint or Teams without full Premium). Like A-SKUs, EM SKUs are limited to Power BI content only and correspond to smaller capacity sizes (EM3 ~ F32). They cannot run Fabric workloads.
Trial SKU: Microsoft Fabric offers a free trial capacity to let organizations try Fabric for a limited time. The trial capacity provides 64 CUs (equivalent to an F64 SKU) and supports all Fabric features, but lasts for 60 days. This is a fixed-size capacity (roughly equal to a P1 in power) that can be activated without cost. It’s ideal for initial evaluations and proof-of-concept work. After 60 days, the trial expires (though Microsoft has allowed extensions in some cases). Administrators cannot change the size of a trial capacity – it’s pre-set – and there may be limits on the number of trials per tenant.

The table below summarizes the Fabric SKU sizes and their approximate equivalence to Power BI Premium for context:

SKU	Capacity Units (CUs)	Equivalent P-SKU / A-SKU	Power BI v-cores
F2	2 CUs	(no P-SKU; smallest)	0.25 v-core
F4	4 CUs	(no P-SKU)	0.5 v-core
F8	8 CUs	EM1 / A1	1 v-core
F16	16 CUs	EM2 / A2	2 v-cores
F32	32 CUs	EM3 / A3	4 v-cores
F64	64 CUs	P1 / A4	8 v-cores
Trial	64 CUs	(no P-SKU; free trial)	8 v-cores
F128	128 CUs	P2 / A5	16 v-cores
F256	256 CUs	P3 / A6	32 v-cores
F512	512 CUs	P4 / A7	64 v-cores
F1024	1024 CUs	P5 / A8	128 v-cores
F2048	2048 CUs	(no direct P-SKU)	256 v-cores

Table: Fabric capacity SKU sizes in Capacity Units (CU) with equivalent legacy SKUs. Note: P-SKUs P1–P5 correspond to F64–F1024. A-SKUs and EM-SKUs only support Power BI content and roughly map to F8–F32 sizes.

In practical terms, F64 (64 CU) is the threshold where a capacity is considered “Premium” in the Power BI sense – it has the same 8 v-cores as a P1. Indeed, content in workspaces on an F64 or larger can be consumed by viewers with a free Fabric license (no Pro license needed). By contrast, the smaller F2–F32 capacities, while useful for light workloads or development, do not remove the need for Power BI Pro licenses for content consumers. Administrators should be aware of this distinction: if your goal is to enable broad internal report sharing to free users, you will need at least an F64 capacity.

To recap SKU differences: F-SKUs are the modern, Azure-based Fabric capacities that cover all workloads and offer flexibility (pause/resume, hourly billing). P-SKUs (legacy Premium) also cover all workloads but are being phased out for new purchases, and they require an annual subscription (though existing ones can continue to be used for Fabric). A/EM SKUs are limited to Power BI content only and primarily used for embedding scenarios; they might still be relevant if your organization only cares about Power BI and wants a smaller or cost-specific option. And the trial capacity is a temporary F64 equivalent provided free for evaluation purposes.

Licensing and Provisioning

Before you can use a Fabric capacity, you must license and provision it for your tenant. This involves understanding how to acquire the capacity (through Azure or Microsoft 365), what user licenses are needed, and how to set up the capacity in the admin portal.

Purchasing a capacity: For F-SKUs and A/EM SKUs, capacities are purchased via an Azure subscription. You (or your Azure admin) will create a Microsoft Fabric capacity resource in Azure, selecting the SKU size (e.g. F64) and region. The capacity resource is billed to your Azure account. For P-SKUs (if you already have one), they were purchased through the Microsoft 365 admin center (as a SaaS license commitment). As noted, new P-SKU purchases are no longer available after July 2024. If you have existing P capacities, they will show up in the Fabric admin portal automatically. Otherwise, new capacity needs will be fulfilled by creating F-SKUs in Azure.

Provisioning and setup: Once purchased, the capacity must be provisioned in your Fabric tenant. For Azure-based capacities (F, A, EM), this happens automatically when you create the resource – you will see the new capacity listed in the Fabric Admin Portal under Capacity settings. You need to be a Fabric admin or capacity admin to access this. In the Fabric Admin Portal (accessible via the gear icon in the Fabric UI), under Capacity Settings, you will find tabs for Power BI Premium, Power BI Embedded, Fabric capacity, and Trial. Your capacity will appear in the appropriate section (e.g., an F-SKU under “Fabric capacity”). From there, you can manage its settings (more on that later) and assign workspaces to it.

When creating an F capacity in Azure, you will choose a region (datacenter location) for the capacity. This determines where the compute resources live and typically where the data for Fabric items in that capacity is stored. For example, if you create an F64 in West Europe, a Fabric Warehouse or Lakehouse created in a workspace on that capacity will reside in West Europe region (useful for data residency requirements). Organizations with global presence might provision capacities in multiple regions to keep data and computation local to users or comply with regulations.

Per-user licensing requirements: Even with capacities, Microsoft Fabric uses a mix of capacity licensing and per-user licenses:

Every user who authors content or needs access to Power BI features beyond viewing must have a Power BI Pro license (or Premium Per User) unless the content is in a capacity that allows free-user access. In Fabric, a Free user license lets you create and use non-Power BI Fabric items (like Lakehouses, notebooks, etc.) in a capacity workspace, but it does not allow creating standard Power BI content in shared workspaces or sharing those with others. To publish Power BI reports to a workspace (other than your personal My Workspace) and share them, you still need a Pro license or PPU. Essentially, capacity removes license requirements for viewing content (if the capacity is sufficiently large), but content creators typically need Pro/PPU licenses for Power BI work.
For viewers of content: If the workspace is on a capacity smaller than F64, all viewers need Pro licenses as if it were a normal shared workspace. If the workspace is on an F64 or larger capacity (or a P-SKU capacity), then free licensed users can view the content (they just need the basic Fabric free license and viewer role). This is analogous to Power BI Premium capacity behavior. So an admin must plan license needs accordingly – for true wide audience distribution, ensure the capacity is at least F64, otherwise you won’t realize the “free user view” benefit.
Premium Per User (PPU): PPU is a per-user licensing option that provides most Premium features to individual users on shared capacity. While not a capacity, it’s relevant in capacity planning: if you have a small number of users that need premium features, PPU can be more cost-effective than buying a whole capacity. Microsoft suggests considering PPU if fewer than ~250 users need Premium capabilities. For example, rather than an F64 which supports unlimited users, 50 users could each get PPU licenses. However, PPU does not support the broader Fabric workloads (it’s mainly a Power BI feature set license), so if you want the Fabric engineering/science features, you need a capacity.

In summary, to get started you will purchase or activate a capacity and ensure you have at least one user with a Pro (or PPU) license to administer it and publish Power BI content. Many organizations begin with the Fabric trial capacity – any user with admin rights can initiate the trial from the Fabric portal, which creates the 60-day F64 capacity for the tenant. During the trial period, you might allow multiple users to experiment on that capacity. Once ready to move to production, you would purchase an F-SKU of appropriate size. Keep in mind that a trial capacity is time-bound and also fixed in size (you cannot scale a trial up or down). So after gauging usage in trial, you’ll choose a permanent SKU.

Capacity Planning and Sizing Guidance

Choosing the right capacity size is a critical early decision. Capacity planning is the process of estimating how many CUs (or what SKU tier) you need to run your workloads smoothly, both now and in the future. The goal is to avoid performance problems like slow queries or job failures due to insufficient resources, while also not over-paying for idle capacity. This section provides guidance on sizing a capacity and adjusting it as usage evolves.

Understand your workloads and users: Start by profiling the types of workloads and usage patterns you expect on the capacity. Key factors include:

Data volume and complexity: Large data models (e.g. huge Power BI datasets) or heavy ETL processes (like frequent dataflows or Spark jobs) will consume more compute and memory. If you plan to refresh terabyte-scale datasets or run complex transformations daily, size up accordingly.
Concurrent users and activities: Power BI workloads with many simultaneous report users or queries (or heavy embedded analytics usage) can drive up CPU and memory usage quickly. A capacity serving 200 concurrent dashboard users needs more CUs than one serving 20 users. Concurrency in Spark jobs or SQL queries similarly affects load.
Real-time or continuous processing: If you have real-time analytics (such as continuous event ingestion, KQL databases for IoT telemetry, or streaming datasets), your capacity will see constant usage rather than brief spikes. Ongoing processes mean you need enough capacity to sustain a baseline of usage 24/7.
Advanced analytics and data science: Machine learning model training or large-scale data science experiments can be very computationally intensive (high CPU for extended periods). A few data scientists running complex notebooks might consume more CUs than dozens of basic report users. Also consider if they will run jobs concurrently.
Number of users/roles: The more users with access, the greater the chance of overlapping activities. A company with 200 Power BI users running reports will likely require more capacity than one with 10 engineers doing data transformations. Even if each individual task isn’t huge, many small tasks add up.

By evaluating these factors, you can get a rough sense of whether you need a small (F2–F16), medium (F32–F64), or large (F128+) capacity.

Start with data and tools: Microsoft recommends a data-driven approach to capacity sizing. One strategy is to begin with a trial capacity or a small pay-as-you-go capacity, run your actual workloads, and measure the utilization. The Fabric Capacity Metrics App can be installed to monitor CPU utilization, memory, etc., and identify peaks. Over a representative period (say a busy week), observe how much of the 64 CU trial is used. If you find that utilization is peaking near 100% and throttling occurs, you likely need a larger SKU. If usage stays low (e.g. under 30% most of the time), you might get by with a smaller SKU in production or keep the same size with headroom.

Microsoft provides guidance to “start small and then gradually increase the size as necessary.” It’s often best to begin with a smaller capacity, see how it performs, and scale up if you approach limits. This avoids overcommitting to an expensive capacity that you might not fully use. With Fabric’s flexibility, scaling up (or down) capacity is relatively easy through Azure, and short-term overuse can be mitigated by bursting (discussed later).

Concretely, you would:

Measure consumption – perhaps use an F32 or F64 on a trial or month-to-month basis. Use the metrics app to check the CU utilization over time (Fabric measures consumption in 30-second intervals; multiply CUs by 30 to get CU-seconds per interval). Identify peak times and which workloads are driving them (the metrics app breaks down usage by item type, e.g. dataset vs Spark notebook).
Identify requirements – If your peak 30-second CU use is, say, 1500 CU-seconds, that’s roughly 50 CUs worth of power needed continuously in that peak period (since 30 sec * 50 CU = 1500). That suggests an F64 might be just enough (64 CUs) with some buffer, whereas an F32 (32 CUs) would throttle. On the other hand, if peaks only hit 200 CU-seconds (which is ~7 CUs needed), even an F8 could handle it.
Scale accordingly – Choose the SKU that covers your typical peak. It’s wise to allow some headroom, as constant 100% usage will lead to throttling. For instance, if your trial F64 shows occasional 80% spikes, moving to a permanent F64 could be fine thanks to bursting, but if you often hit 120%+ (bursting into future capacity), you should consider F128 or splitting workloads.

Microsoft has also provided a Fabric Capacity Estimator tool (on the Fabric website) which can help model capacity needs by inputting factors like number of users, dataset sizes, refresh rates, etc. This can be a starting point, but real usage metrics are more reliable.

Planning for growth and variability: Keep in mind future growth – if you expect user counts or data volumes to double in a year, factor that into capacity sizing (you may start at F64 and plan to increase to F128 later). Also consider workload timing. Some capacities experience distinct daily peaks (e.g., heavy ETL jobs at 2 AM, heavy report usage at 9 AM). Thanks to Fabric’s bursting and smoothing, a capacity can handle short peaks above its baseline, but if two peaks overlap or usage grows, you might need a bigger size or to schedule workloads to avoid contention. Where possible, schedule intensive background jobs (data refreshes, scoring runs) during off-peak hours for interactive use, to reduce concurrent strain on the capacity.

In summary, do your homework with a trial or pilot phase, leverage monitoring tools, and err on the side of starting a bit smaller – you can always scale up. Capacity planning helps you choose the right SKU and avoid slow queries or throttling while optimizing spend. And remember, you can have multiple capacities too; sometimes the answer is not one gigantic capacity, but two or three medium ones splitting different workloads (we’ll discuss this in governance).

Workload Management Across Fabric Experiences

One of the powerful aspects of Microsoft Fabric is that a single capacity can run a diverse set of workloads: Power BI reports, Spark notebooks, data pipelines, real-time KQL databases, AI models, etc. The capacity’s compute is shared by all these workloads. This section explains how to manage and balance different workloads on a capacity.

Unified capacity, multiple workloads: Fabric capacities are multi-tenant across workloads by design – you don’t buy separate capacity for Power BI vs Spark vs SQL. For example, an F64 capacity could simultaneously be handling a Power BI dataset refresh, a SQL warehouse query, and a Spark notebook execution. All consume from the same pool of 64 CUs. This unified model simplifies architecture: “It doesn’t matter if one user is using a Lakehouse, another is running notebooks, and a third is executing SQL – they can all share the same capacity.” All items in workspaces assigned to that capacity draw on its resources.

However, as an admin, you need to be mindful of resource contention: a very heavy job of one type can impact others. Fabric tries to manage this with an intelligent scheduler and the bursting/smoothing mechanism (which prioritizes interactive operations). Still, you should consider the nature of workloads when assigning them to capacities. Some guidance:

Power BI workloads: These include interactive report queries (DAX queries against datasets), dataset refreshes, dataflows, AI visuals, and paginated reports. In the capacity settings, admins have specific Power BI workload settings (for example, enabling the AI workload for cognitive services, or adjusting memory limits for datasets, similar to Power BI Premium settings). Ensure these are configured as needed – e.g., if you plan on using AI visualizations or AutoML in Power BI, make sure the AI workload is enabled on the capacity. Large semantic models (datasets) can consume a lot of memory; by default Fabric will manage their loading and eviction, but you may want to keep an eye on total model sizes relative to capacity. Paginated reports can be enabled if needed (they can be memory/CPU heavy during execution).
Data Engineering & Science (Spark): Fabric provides Spark engines for notebooks and job definitions. By default, when a Spark job runs, it uses a portion of the capacity’s cores. In fact, for Spark workloads, Microsoft has defined that each 1 CU = 2 Spark vCores of compute power. For example, an F32 (32 CU) capacity has 64 Spark vCores available to allocate across Spark clusters. These vCores are dynamically allocated to Spark sessions as users run notebooks or Spark jobs. Spark has a built-in concurrency limit per capacity: if all Spark vCores are in use, additional Spark jobs will queue until resources free up. As an admin, you can allow or disallow workspace admins from configuring Spark pool sizes on your capacity. If you enable it, power users might spin up large Spark executors that use many cores – beneficial for performance, but potentially starving other workloads. If Spark usage is causing contention, consider limiting the max Spark nodes or advising users to use moderate sizes. Notably, Fabric capacities support bursting for Spark as well – the system can utilize up to 3× the purchased Spark vCores temporarily to run more Spark tasks in parallel. This helps if you occasionally have many Spark jobs at once, but sustained overuse will still queue or throttle. For heavy Spark/ETL scenarios, you might dedicate a capacity just for that to isolate it from BI users.
Data Warehousing (SQL) and Real-Time Analytics (KQL): These workloads run SQL queries or KQL (Kusto Query Language) queries against data warehouses or real-time analytics databases. They consume CPU during query execution and memory for caching data. They are treated as background jobs if run via scheduled processes, or interactive if triggered by a user query. Fabric’s smoothing generally spreads out heavy background query loads over time. Nevertheless, a very expensive SQL query can momentarily spike CPU. As admin, ensure your capacity can handle peak query loads or advise your data teams to optimize queries (like proper indexing on warehouses) to avoid excessive load. There are not many specific toggles for SQL/KQL workloads in capacity settings (beyond enabling the Warehouse or Real-Time Analytics features which are on by default for F and P capacities).
OneLake and data movement: OneLake is the storage foundation for Fabric. While data storage itself doesn’t “consume” capacity CPU (storage is separate), activities like moving data (copying via pipelines), scanning large files, or loading data into a dataframe will use capacity compute. Data integration pipelines (if using Data Factory in Fabric) also run on the capacity. Keep an eye on any heavy data copy or transformation activities, as those are background tasks that could contribute to load.

Isolation and splitting workloads: If you find that certain workloads dominate the capacity, you might consider splitting them onto separate capacities. For instance, a common approach is to separate “self-service BI” and “data engineering” onto different capacities so that a big Spark job doesn’t slow down a business report refresh. Microsoft notes that provisioning multiple capacities can isolate compute for high-priority items or different usage patterns. You could have one capacity dedicated to Power BI content for executives (ensuring their reports are always snappy), and a second capacity for experimental data science projects. This kind of workload isolation via capacities is a governance decision (we will cover more in the governance section). The trade-off is cost and utilization – separate capacities ensure no interference, but you might end up with unused capacity in each if peaks happen at different times. A single capacity shared by all can be more cost-efficient if the workloads’ peak times are complementary.

Tenant settings delegation: In Fabric, some tenant-level settings (for example, certain Power BI tenant settings or workload features) can be delegated to the capacity level. This means you can override a global setting for a specific capacity. For instance, you might have a tenant setting that limits the maximum size of Power BI datasets for Pro workspaces, but for a capacity designated to a specific team, you allow larger models. In the capacity management settings, check the Delegated tenant settings section if you need to tweak such options for one capacity without affecting others. This feature allows granular control, such as enabling preview features or higher limits on a capacity used by advanced users while keeping defaults elsewhere.

Monitoring workload mix: Use the Capacity Metrics App or the Fabric Monitoring Hub to see what types of operations are consuming the most resources. The app can break down usage by item type (e.g., dataset vs Spark vs pipeline) to help identify if one category is the culprit for high utilization. If you notice, for example, that Spark jobs are consistently using the majority of CUs (perhaps visible as high background CPU), it may prompt you to adjust Spark configurations or move some Spark-heavy workspaces off to another capacity.

In summary, Fabric capacities are shared across all workload types, which is great for flexibility but requires good management to ensure balance. Leverage capacity settings to tune specific workloads (Power BI workload enabling, Spark pool limits, etc.), monitor the usage by workload type, and consider logical separation of workloads via multiple capacities if needed. Microsoft Fabric is designed so that the platform itself handles a lot of the balancing (through smoothing of background jobs), but administrator insight and control remain important to avoid any single workload overwhelming the rest.

Isolation and Security Boundaries

Microsoft Fabric capacities play a role in isolation at several levels – performance isolation, security isolation, and even geographic isolation. It’s important to understand what a capacity isolates (and what it doesn’t) within a Fabric tenant, and how to leverage capacities for governance or compliance.

Performance and resource isolation: A capacity is a unit of isolation for compute resources. Compute usage on one capacity does not affect other capacities in the tenant. If Capacity A is overloaded and throttling, it will not directly slow down Capacity B, since each has its own quota of CUs and separate throttling counters. This means you can confidently separate critical workloads by placing them in different capacities to ensure that heavy usage in one area (e.g., a dev/test environment) cannot degrade the performance of another (e.g., production reports). The Fabric platform applies throttling at the capacity scope, so even within the same tenant, one capacity “failing” (hitting limits) doesn’t spill over into another. As noted, there is an exception when it comes to cross-capacity data access: if a Fabric item in Capacity B is trying to query data that resides in Capacity A (for example, a dataset in B accessing a Lakehouse in A via OneLake), then the consuming capacity’s state is what matters for throttling that query. Generally, such cross-capacity consumption is not common except through shared storage like OneLake, and the compute to actually retrieve the data will be accounted to the consumer’s capacity.

Security and content isolation: It’s crucial to realize that a capacity is not a security boundary in terms of data access. All Fabric content security is governed by Entra ID (Azure AD) identities, roles, and workspace permissions, not by capacity. For example, just because Workspace X is on Capacity A and Workspace Y is on Capacity B does not mean users of X cannot access Y – if a user has the right permissions, they can access both. Capacities do not define who can see data; they define where it runs. So if you have sensitive data that only certain users should access, you still must rely on workspace-level security or separate Entra tenants, not merely separate capacities.

That said, capacities can assist with administrative isolation. You can delegate capacity admin roles so that different people manage different capacities. For instance, the finance IT team might be given admin rights to the “Finance Capacity” and they can control which workspaces go into it, without affecting other capacities. Additionally, you can control which workspaces are assigned to which capacity. By limiting capacity assignment rights (via the Contributor permissions setting on a capacity, which you can restrict to specific security groups), you ensure that, say, only approved workspaces/projects go into a certain capacity. This can be thought of as a soft isolation: e.g., only the HR team’s workspaces are placed in the HR capacity, keeping that compute “clean” from others.

Geographical and compliance isolation: If your organization has data residency requirements (for example, EU data must stay in EU datacenters, US data in US), capacities are a useful construct. When you create a capacity, you choose an Azure region for it. Workspaces on that capacity will allocate their Fabric resources in that region. This means you can satisfy multi-geo requirements by having separate capacities in each needed region and assigning workspaces accordingly. It isolates the data and compute to that geography. (Do note that OneLake has a global aspect, but it stores files/objects in the region of the capacity or the region you designate when creating the item. Check Fabric documentation on multi-geo support for details – company examples show deploying capacities per geography).

Tenant isolation: The ultimate isolation boundary is the Microsoft Entra tenant. Fabric capacities exist within a tenant. If you truly need completely separate environments (different user directories, no possibility of data or admin overlap), you would use separate Entra tenants (as was illustrated by Microsoft with one company using two tenants for different divisions). That, however, is a very high level of isolation usually only used in scenarios like M&A, extreme security separation, or multi-tenant services. Within one tenant, capacities give you isolation of compute but not identity.

Network isolation: As a side note, Fabric is a cloud SaaS, but it does provide features like Managed Virtual Networks for certain services (e.g., Data Factory pipelines or Synapse integration). These features allow you to restrict outbound data access to approved networks. While not directly related to capacity, these network security options can be enabled per workspace or capacity environment to ensure data does not leak to the public internet. If your organization requires network isolation, investigate Fabric’s managed VNet and private link support for the relevant workloads.

In summary, use capacities to create performance and administrative isolation within your tenant. Assign sensitive or mission-critical workloads their own capacity so they are shielded from others’ activity. But remember that all capacities under a tenant still share the same identity and security context; manage access via roles and perhaps use separate tenants if absolute isolation is needed. Also use capacities for geo-separation if needed by creating them in the appropriate regions.

Monitoring and Metrics

Continuous monitoring of capacity health and usage is vital to ensure you are getting the most out of your capacity and to preempt any issues like throttling. Microsoft Fabric provides several tools and metrics for capacity and workload monitoring.

Capacity Utilization Metrics: The primary tool for capacity admins is the Fabric Capacity Metrics App. This is a Power BI app (or report template) provided by Microsoft that connects to your capacity’s telemetry. It offers dashboards showing CPU utilization (%) over time, broken down by workloads and item types. You can see, for example, how much CPU was used by Spark vs datasets vs queries, etc., and identify the top consuming activities. The app typically looks at recent usage (last 7 days or 30 days) in 30-second intervals. Key visuals include the Utilization chart (showing how close to capacity limit you are) and possibly specific charts for interactive vs background load. As an admin, you should regularly review these metrics. Spikes to 100% indicate that you’re using all available CUs and likely bursting beyond capacity (which could lead to throttling if sustained). If you notice consistent high usage, it may be time to optimize or scale up.

Throttling indicators: Monitoring helps reveal if throttling is occurring. In Fabric, throttling can manifest as delays or failures of operations when the capacity is overextended. The metrics app might show when throttling events happen (e.g., a drop in throughput or specific events count). Additionally, some signals of throttling include user reports of slowness, refresh jobs taking longer or failing with capacity errors, or explicit error messages. Fabric may return an HTTP 429 or 430 error for certain overloaded scenarios (for example, Spark jobs will give a specific error code 430 if capacity is at max concurrency). As admin, watch for these in logs or user feedback.

Real-time monitoring: For current activity, the Monitoring Hub in the Fabric portal provides a view of running and recent operations across the tenant. You can filter by capacity to see what queries, refreshes, Spark jobs, etc., are happening “now” on a capacity and their status. This is useful if the capacity is suddenly slow – you can quickly check if a particular job is consuming a lot of resources. The Monitoring Hub will show active operations and those queued or delayed due to capacity.

Administrator Monitoring Workspace: Microsoft has an Admin Monitoring workspace (sometimes automatically available in the tenant or downloadable) that contains some pre-built reports showing usage and adoption metrics. This might include things like the most active workspaces, most refreshed datasets, etc., across capacities. It’s more about usage analytics, but it can help identify which teams or projects are heavily using the capacity.

External monitoring (Log Analytics): For more advanced needs, you can connect Fabric (especially Power BI aspects) to Azure Log Analytics to capture certain logs, and also collect logs from the On-premises Data Gateway (if you use one). Log Analytics might collect events like dataset refresh timings, query durations, etc. While not giving direct CPU usage, these can help correlate if failures coincide with high load times.

Key metrics to watch:

CPU Utilization %: How close to max CUs you are over time. Spikes to 100% sustained for multiple minutes are a red flag.
Memory: Particularly for Power BI (dataset memory consumption) – if you load multiple large models, ensure they fit in memory. The capacity metrics app shows memory usage per dataset. If near the limits, consider larger capacity or offloading seldom-used models.
Active operations count: Many concurrent operations (queries, jobs) can hint at saturation. For instance, if dozens of queries run simultaneously, you might hit limits even if each is light.
Throttle events: If the metrics indicate delayed or dropped operations, or the Fabric admin portal shows notifications of throttling, that’s a clear indicator.

Notifications: A best practice is to set up alerts/notifications when capacity usage is high. The Fabric capacity settings allow you to configure email notifications if utilization exceeds a certain threshold for a certain time. For example, you might set a notification if CPU stays over 80% for more than 5 minutes. This proactive alert can prompt you to intervene (perhaps scale up capacity or investigate the cause) before users notice major slowdowns.

SLA and user experience: Ultimately, the reason we monitor is to ensure a good user experience. Identify patterns like time of day spikes (maybe every Monday 9AM there’s a huge hit) and mitigate them (maybe by rescheduling some background tasks). Also track the performance of key reports or jobs over time – if they start slowing down, it could be capacity pressure.

In summary, leverage the available telemetry: Fabric Capacity Metrics App for historical trends, Monitoring Hub for real-time oversight, and set up alerts. By keeping a close eye on capacity metrics, you can catch issues early (such as creeping utilization that approaches limits) and take action – whether optimization, scaling, or spreading out the workload – to maintain smooth operations.

Autoscale and Bursting: Managing Peak Loads

One of the novel features of Microsoft Fabric’s capacity model is how it handles peak demands through bursting and smoothing, effectively providing an “autoscaling” experience within the capacity. In this section, we explain these concepts and how to plan for bursts, as well as other autoscale options (such as manual scale-out and Spark autoscaling).

Bursting and smoothing: Fabric is designed to deliver fast performance, even for short spikes in workload, without requiring you to permanently allocate capacity for the peak. It does this via bursting, which allows the capacity to temporarily use more compute than its provisioned CU limit when needed. In other words, your capacity can “burst” above 100% utilization for a short period so that intensive operations finish quickly. This is complemented by smoothing, which is the system’s way of averaging out that burst usage over time so that you’re not immediately penalized. Smoothing spreads the accounting of the consumed CUs over a longer window (5 minutes for interactive operations, up to 24 hours for background operations).

Put simply: “Bursting lets you use more power than you purchased (within a specific timeframe), and smoothing makes sure this over-use is under control by spreading its impact over time.”. For example, if you have an F64 capacity but a particular query needs the equivalent of 128 CUs for a few seconds, Fabric will allow it – the job will complete faster thanks to bursting beyond 64 CUs. Then, the “excess” usage is smoothed into subsequent minutes (meaning for some time after, the capacity’s available headroom is reduced as it pays back that borrowed compute). This mechanism gives an effect similar to short-term autoscaling: the capacity behaves as if it scaled itself up to handle a bursty load, then returns to normal.

Throttling and limits: Bursting is not infinite – it’s constrained by how much future capacity you can borrow via smoothing. Fabric has a throttling policy that kicks in if bursts go on too long or too high. The system tolerates using up to 10 minutes of future capacity with no throttling (this is like a built-in grace period). If you consume more than 10 minutes worth of CUs in advance, Fabric will start applying gentle throttling: interactive operations get a small 20-second delay on submission when between 10 and 60 minutes of capacity overage is consumed. This is phase 1 throttling – users might notice a slight delay but operations still run. If the capacity has consumed over an hour of future CUs (meaning it’s been running well above its quota for a sustained period), it enters phase 2 where interactive operations are rejected outright (while background jobs can still start). Finally, if over 24 hours of capacity is consumed (an extreme overload), all operations (interactive and background) are rejected until usage recovers. The table below summarizes these stages:

Excess Usage (beyond capacity)	System Behavior	Impact
Up to 10 minutes of future capacity	Overage protection (bursting)	No throttling; operations run normally.
10 – 60 minutes of overuse	Interactive delay	New interactive operations (user queries, etc.) are delayed ~20s in queue. Background jobs still start immediately.
60 minutes – 24 hours of overuse	Interactive rejection	New interactive operations are rejected (fail immediately). Background jobs continue to run/queue.
Over 24 hours of overuse	Full rejection	All new operations are rejected (both interactive and background) until the capacity “catches up”.

Table: Throttling thresholds in Fabric’s capacity model. Fabric bursts up to 10 minutes with no penalty. Beyond that, throttling escalates in stages to protect the system.

For most well-managed capacities, you ideally operate in the safe zone (under 10 minutes overage) most of the time. Occasional dips into the 10-60 minute range are fine (users might not even notice the minor delays). If you ever hit the 60+ minute range, that’s a sign the capacity is under-provisioned for the workload or a particular job is too heavy – it should prompt optimization or scaling.

Autoscaling options: Unlike some cloud services that spin up new instances automatically, Fabric’s approach to autoscale is primarily through bursting (which is automatic but time-limited). However, you do have some manual or semi-automatic options:

Manual scale-up/down: Because F-SKUs are purchased via Azure, you can scale the capacity resource to a different SKU on the fly (e.g., from F64 to F128 for a day, then back down). If you have a reserved base (like an F64 reserved instance), you can temporarily scale up using pay-as-you-go to a larger SKU to handle a surge. For instance, an admin might anticipate heavy year-end processing and raise the capacity for that week. Microsoft will bill the overage at the hourly rate for the higher SKU during that period. This is a proactive autoscale you perform as needed. It’s not automatic, but you could script it or use Azure Automation/Logic Apps to trigger scaling based on metrics (there are solutions shared by the community to do exactly this).
Scale-out via additional capacity: Another approach if facing continual heavy load is to add another capacity and redistribute work. For example, if one capacity is maxed out daily, you could purchase a second capacity and move some workspaces to it (spreading the load). This isn’t “autoscale” per se (since it’s a static split unless you later combine them), but it’s a way to increase total resources. Because Fabric charges by capacity usage, two F64s cost the same as one F128 in pay-go terms, so cost isn’t a downside, and you gain isolation benefits.
Spark autoscaling within capacity: For Spark jobs, Fabric allows configuration of auto-scaling Spark pools (the number of executors can scale between a min and max) which optimizes resource usage for Spark jobs. This feature, however, operates within the capacity’s limits – it won’t exceed the total cores available unless bursting provides headroom. It simply means a Spark job will request more nodes if needed and free them when done, up to what the capacity can supply. There is also a preview feature called Spark Autoscale Billing which, if enabled, can offload Spark jobs to a completely separate serverless pool billed independently. That effectively bypasses the capacity for Spark (useful if you don’t want Spark competing with your capacity at all), but since it’s a preview and separate billing, most admins will primarily consider it if Spark is a huge part of their usage and they want a truly elastic experience.
Surge Protection: Microsoft introduced surge protection (currently in preview) for Fabric capacities, which is a setting that limits the total amount of background compute that can run when the capacity is under strain. If enabled, when interactive activities surge, the system will start rejecting background jobs preemptively so that interactive users aren’t as affected. This doesn’t give more capacity, but it triages usage to favor user-driven queries. It’s a protective throttle that helps the capacity recover faster from a spike. As an admin, if you have critical interactive workloads, you might turn this on to ensure responsiveness (at the cost of some background tasks failing and needing retry).

Clearing overuse: If your capacity does get into a heavily throttled state (e.g., many hours of overuse accumulated), one way to reset is to pause and resume the capacity. Pausing essentially stops the capacity (dropping all running tasks) and when resumed, it starts fresh with no prior overhang – but note, any un-smoothed burst usage gets immediately charged at that point. In effect, pausing is like paying off your debt instantly (since when the capacity is off, you can’t “pay back” with idle time, so you are billed for the overage). This is a drastic action (users will be disrupted by a pause), so it’s not a routine solution, but in extreme cases an admin might do this during off hours to clear a badly throttled capacity. Typically, optimizing the workload or scaling out is preferable to hitting this situation.

Design for bursts: Thanks to bursting, you don’t have to size your capacity for the absolute peak if it’s short-lived. Plan for the daily average or slightly above instead of the worst-case peak. Bursting will handle the occasional spike that is, say, 2-3× your normal usage for a few minutes. For example, if your daily work typically uses ~50 CUs but a big refresh at noon spikes to 150 CUs for 1 minute, an F64 capacity can still handle it by bursting (150/64 = ~2.3x for one minute, which smoothing can cover over the next several minutes). This saves cost because you avoid buying an F128 just for that one minute. The system’s smoothing will amortize that one minute over the next 5-10 minutes of capacity. However, if those spikes start lasting 30 minutes or happening every hour, then you do effectively need a larger capacity or you’ll degrade performance.

In conclusion, Fabric’s bursting and smoothing provide a built-in cushion for peaks, acting as an automatic short-term autoscale. As an admin, you should still keep an eye on how often and how deeply you burst (via metrics), and use true scaling strategies (manual scale-up or adding capacity) if needed for sustained load. Also take advantage of features like Spark pool autoscaling and surge protection to further tailor how your capacity handles variable workloads. The combination of these tools ensures you can maintain performance without over-provisioning for rare peaks, achieving a cost-effective balance.

Governance and Best Practices for Capacity Assignment

Managing capacities is not just about the hardware and metrics – it also involves governance: deciding how capacities are used within your organization, which workspaces go where, and enforcing policies to ensure efficient and secure usage. Here are best practices and guidelines for capacity and tenant admins when assigning and governing capacities.

1. Organize capacities by function, priority, or domain: It often makes sense to allocate different capacities for different purposes. For example, you might have a capacity dedicated to production BI content (high priority reports for executives) and another for self-service and development work. This way, heavy experimentation in the dev capacity cannot interfere with the polished dashboards in prod. Microsoft gives an example of using separate capacities so that executives’ reports live on their own capacity for guaranteed performance. Some common splits are:

By department or business unit: e.g., Finance has a capacity, Marketing has another – helpful if departments have very different usage patterns or need cost accountability.
By workload type: e.g., one capacity for all Power BI reports, another for data engineering pipelines and science projects. This can minimize cross-workload contention.
By environment: e.g., one for Production, one for Test/QA, one for Development. This aligns with software lifecycle management.
By geography: as discussed, capacities by region (EMEA vs Americas, etc.) if data residency or local performance is needed.

Having multiple capacities incurs overhead (you must monitor and manage each), so don’t over-segment without reason. But a thoughtful breakdown can improve both performance isolation and clarity in who “owns” the capacity usage.

2. Control workspace assignments: Not every workspace needs to be on a dedicated capacity. Some content can live in the shared (free) capacity if it doesn’t need premium features. As an admin, you should have a process for requesting capacity assignment. You might require that a workspace meet certain criteria (e.g., it’s for a project that requires larger dataset sizes or will have broad distribution) before assigning it to the premium capacity. This prevents trivial or personal projects from consuming expensive capacity resources. In Fabric, you can restrict the ability to assign a workspace to a capacity by using Capacity Contributor permissions. By default, it might allow the whole organization, but you can switch it to specific users or groups. A best practice is to designate a few power users or a governance board that can add workspaces to the capacity, rather than leaving it open to all.

Also consider using the “Preferred capacity for My workspace” setting carefully. Fabric allows you to route user personal workspaces (My Workspaces) to a capacity. While this could utilize capacity for personal analyses, it can also easily overwhelm a capacity if many users start doing heavy work in their My Workspace. Many organizations leave My Workspaces on shared capacity (which requires those users to have Pro licenses for any Power BI content in them) and only put team or app workspaces on the Fabric capacities.

3. Enforce capacity governance policies: There may be tenant-level settings you want to enforce or loosen per capacity. For instance, perhaps in a special capacity for data science you allow higher memory per dataset or allow custom Visualizations that are otherwise disabled. Use the delegated tenant settings feature to override settings on specific capacities as needed. Another example: you might want to disable certain preview features or enforce specific data export rules in a production capacity for security, while allowing them in a dev capacity.

4. Educate workspace owners: Ensure that those who have their workspace on a capacity know the “dos and don’ts.” They should understand that it’s a shared resource – e.g., a badly written query or an extremely large dataset refresh can impact others. Encourage best practices like scheduling heavy refreshes during off-peak times, enabling incremental refresh for large datasets (to reduce refresh load), optimizing DAX and SQL queries, and so on. Capacity admins can provide guidelines or even help review content that will reside on the capacity.

5. Leverage monitoring for governance: Keep track of which workspaces or projects are consuming the most capacity. If one workspace is monopolizing resources (you can see this in metrics, which identify top items), you might decide to move that workspace to its own capacity or address the inefficiencies. You can even implement an internal chargeback or at least show departments how much capacity they consumed to promote accountability.

6. Plan for lifecycle and scaling: Governance also means planning how to scale or reassign as needs change. If a particular capacity is consistently at high load due to growth of a project, have a strategy to either scale that capacity or redistribute workspaces. For example, you might spin up a new capacity and migrate some workspaces to it (admins can change a workspace’s capacity assignment easily in the portal). Microsoft notes you can “scale out” by moving workspaces to spread workload, which is essentially a governance action as much as a performance one. Also, when projects are retired or become inactive, don’t forget to remove their workspaces from capacity (or even delete them) so they don’t unknowingly consume resources with forgotten scheduled operations.

7. Security considerations: While capacity doesn’t enforce security, you can use capacity assignment as part of a trust boundary in some cases. For instance, if you have a workspace with highly sensitive data, you might decide it should run on a capacity that only that team’s admins control (to reduce even the perception of others possibly affecting it). Also, if needed, capacities can be tied to different encryption keys (Power BI allows BYOK for Premium capacities) – check if Fabric supports BYOK per capacity if that’s a requirement.

8. Documentation and communication: Treat your capacities as critical infrastructure. Document which workspaces are on which capacity, what the capacity sizes are, and any rules associated with them. Communicate to your user community about how to request space on a capacity, what the expectations are (like “if you are on the shared capacity, you get only Pro features; if you need Fabric features, request placement on an F SKU” or vice versa). Clear guidelines will reduce ad-hoc and potentially improper use of the capacities.

In essence, governing capacities is about balancing freedom and control. You want teams to benefit from the power of capacities, but with oversight to ensure no one abuses or unknowingly harms the shared environment. Using multiple capacities for natural boundaries (dept, env, workload) and controlling assignments are key techniques. As a best practice, start somewhat centralized (maybe one capacity for the whole org in Fabric’s early days) and then segment as you identify clear needs to do so (such as a particular group needing isolation or a certain region needing its own). This way you keep things manageable and only introduce complexity when justified.

Cost Optimization Strategies

Managing cost is a major part of capacity administration, since dedicated capacity represents a significant investment. Fortunately, Microsoft Fabric offers several ways to optimize costs while meeting performance needs. Here are strategies to consider:

1. Use Pay-as-you-go wisely (pause when idle): F-SKUs on Azure are billed on a per-second basis (with a 1-minute minimum) whenever the capacity is running. This means if you don’t need the capacity 24/7, you can pause it to stop charges. For example, if your analytics workloads are mostly 9am-5pm on weekdays, you could script the capacity to pause at night and on weekends. You only pay for the hours it’s actually on. An F8 capacity left running 24/7 costs roughly $1,200 per month, but if you paused it outside of an 8-hour workday, the cost could drop to a third of that (plus no charge on weekends). Always assess your usage patterns – some organizations run critical reports around the clock, but many could save by pausing during predictable downtime. The Fabric admin portal allows pause/resume, and Azure Automation or Logic Apps can schedule it. Just ensure no important refresh or user query is expected during the paused window.

2. Right-size the SKU (avoid over-provisioning): It might be tempting to get a very large capacity “just in case,” but unused capacity is money wasted. Thanks to bursting, you can usually size for slightly above your average load, not the absolute peak. Monitor utilization and if you see your capacity is consistently under 30% utilized, that’s a sign you might scale down to a smaller SKU and save costs (unless you’re expecting growth or deliberately keeping headroom). The granular SKU options (F2, F4, F8, etc.) let you fine-tune. For instance, if F64 is too much and F32 occasionally struggles, an F48 would be ideal – while not an official SKU, you could achieve an “F48” by using reserved capacity units (more on that below) to split or by alternating scheduling (though that’s complex). Generally, stick to SKUs but choose the lowest one that meets requirements with maybe some buffer.

3. Reserved capacity (annual commitment) for lower rates: Pay-as-you-go is flexible but at a higher unit price. Microsoft has indicated and demonstrated that reserved instance pricing for F-SKUs brings significant cost savings (on the order of ~40% cheaper for a 1-year commitment). For example, an F8 costs around €1188/month pay-go, but ~€706/month with a 1-year reservation. If you know you will need a capacity continuously for a long period, consider switching to a reserved model to reduce cost. Importantly, when you reserve, you are reserving a certain number of capacity units, not locking into a specific SKU size. So you could reserve 64 CUs (the equivalent of F64) but choose to run two F32 capacities or one F64 – as long as total CUs in use ≤64, it’s covered by your reservation. This allows flexibility in how you deploy those reserved resources (multiple smaller capacities vs one big one). Also, with reservation, you can still scale up beyond your reserved amount and just pay the excess at pay-go rates. For instance, you reserve F8 (8 CUs) but occasionally scale to F16 for a day – you’d pay the 8 extra CUs at pay-go just for that time. This hybrid approach ensures you get savings on your baseline usage and only pay premium for surges.

4. Monitor and optimize workload costs: Cost optimization can also mean making workloads more efficient so they consume fewer CUs. Encourage good practices like using smaller dataset refresh intervals (don’t over-refresh), turning off refresh for datasets not in use, archiving or deleting old large datasets, using incremental refresh, etc. For Spark, make sure jobs are not running with unnecessarily large clusters idle (auto-terminate them when done, which Fabric usually handles). If using the serverless Spark billing preview, weigh its cost (it might be cheaper if your Spark usage is sporadic, versus holding capacity for it).

5. Mix license models for end-users: Not everyone in your organization needs to use the capacity. You can have a hybrid of Premium capacity and Premium Per User. For example, perhaps you buy a small capacity for critical shared content, but for many other smaller projects, you let teams use PPU licenses on the shared (free) capacity. This way you’re not putting everything on the capacity. As mentioned, PPU is cost effective up to a point (if many users need it, capacity becomes cheaper). You might say: content intended for large audiences goes on capacity (so free users can consume it), whereas content for small teams stays with PPU. Such a strategy can yield substantial savings. It also provides a path for scaling: as a particular report or solution becomes widely adopted, you can move it from the PPU world to the capacity.

6. Utilize lower-tier SKUs and scale out: If cost is a concern and ultra-high performance isn’t required, you could opt for multiple smaller capacities instead of one large one. For example, two F32 capacities might be cheaper in some scenarios than one F64 if you can pause them independently or if you got a deal on smaller ones. That said, Microsoft’s pricing is generally linear with CUs, so two F32 should cost roughly the same as one F64 in pay-go. The advantage would be if you can pause one of them for periods when not needed. Be mindful though: capacities below F64 won’t allow free user report viewing, which could force Pro licenses and shift cost elsewhere.

7. Keep an eye on OneLake storage costs: Fabric capacity covers compute. Storage in OneLake is billed separately (at a certain rate per GB per month). Microsoft’s current OneLake storage cost (~$0.022 per GB/month in one region example) is relatively low, but if you are landing terabytes of data, it will add up. It usually won’t overshadow compute costs, but from a governance perspective, try to clean up unused data (e.g., old versioned data, intermediate files) to avoid an ever-growing storage bill. Also, data egress (moving data out of the region) could have costs, but if staying within Fabric likely not an issue.

8. Periodically review usage and adjust: Cost optimization is not a one-time set-and-forget. Each quarter or so, review your capacity’s utilization and cost. Are you paying for a large capacity that’s mostly idle? Scale it down or share it with more workloads (to get more value out of it). Conversely, if you’re consistently hitting the limits and had to enable frequent autoscale (pay-go overages), maybe committing to a higher base SKU could be more economical. Remember, if you went with a reserved instance, you already paid upfront – ensure you are using what you paid for. If you reserved an F64 but only ever use 30 CUs, you might repurpose some of those CUs to another capacity (e.g., split into F32 + F32) so that more projects can utilize the prepaid capacity.

9. Leverage free/trial features: Make full use of the 60-day Fabric trial capacity before purchasing. It’s free compute time – treat it as such to test heavy scenarios and get sizing estimates without incurring cost. Also, if certain features remain free or included (like some amount of AI functions or some small dataset sizes not counting, etc.), be aware and use them.

10. Watch for Microsoft licensing changes or offers: Microsoft’s cloud services pricing can evolve. For instance, the deprecation of P-SKUs might come with incentives or migration discounts to F-SKUs. There could be offers for multi-year commitments. Stay informed via the Fabric blog or your Microsoft rep for any cost-saving opportunities.

In practice, many organizations find that moving to Fabric F-SKUs saved money compared to the old P-SKUs, if they manage the capacity actively (pausing when not needed, etc.). One user noted Fabric capacity is “significantly cheaper than Power BI Premium capacity” if you utilize the flexible billing. But this is only true if you take advantage of the flexibility – otherwise pay-go could actually cost more than an annual P-SKU if left running 24/7 at high rate. Thus, the onus is on the admin to optimize runtime.

By combining these strategies – dynamic scaling, reserved discounts, license mixing, and efficient usage – you can achieve an optimal balance of performance and cost. The result should be that your organization pays for exactly the level of analytics power it needs, and not a penny more, while still delivering a good user experience.

Real-World Use Cases and Scenario-Based Recommendations

To tie everything together, let’s consider a few typical scenarios and how one might approach capacity management in each:

Scenario 1: Small Business or Team Starting with Fabric
A 50-person company with a small data team is adopting Fabric primarily for Power BI reports and a few dataflows.
Approach: Begin with the Fabric Trial (F64) to pilot your content. Likely an F64 provides ample power for 50 users. During the trial, monitor usage – it might show that even an F32 would suffice if usage is light. Since 50 users is below the ~250 threshold, one option after trial is to use Premium Per User (PPU) licenses instead of buying capacity (each power user gets PPU so they have premium features, and content runs on shared capacity). This could be cheaper initially. However, if the plan is to roll out company-wide reports that everyone consumes, a capacity is beneficial so that even free users can view. In that case, consider purchasing a small F SKU on pay-go, like F32 or F64 depending on trial results. Use pay-as-you-go and pause it overnight to save money. With an F32 (which is below Premium threshold), remember that viewers will need Pro licenses – if you want truly all 50 users (including some without Pro) to access, go with at least F64. Given cost, you might decide on PPU for all 50 instead of F64, which could be more economical until the user base or needs grow. Keep governance light but educate the small team on not doing extremely heavy tasks that might require bigger capacity. Likely one capacity is enough; no need to split by departments since the org is small.

Scenario 2: Mid-size Enterprise focusing on Enterprise BI
A 1000-person company has a BI Center of Excellence that will use Fabric primarily for Power BI (reports & datasets), replacing a P1 Premium. Minimal use of Spark or advanced workloads initially.
Approach: They likely need a capacity that allows free user consumption of reports – so F64 or larger. Given they had a P1, F64 is the equivalent. Use F64 reserved for a year to save about 40% cost over monthly, since they know they need it continuously. Monitor usage: if adoption grows (more reports, bigger datasets), they should watch if utilization nears limits. Perhaps they’ll consider scaling to F128 in the future. In terms of governance, set up one primary capacity for Production BI content. Perhaps also spin up a smaller F32 trial or dev capacity for development and testing of reports, so heavy model refreshes in dev don’t impact prod. The dev capacity could even be paused except during working hours to save cost. For user licensing, since content on F64 can be viewed by free users, they can give all consumers just Fabric Free licenses. Only content creators (maybe ~50 BI developers) need Pro licenses. Enforce that only the BI team can assign workspaces to the production capacity (so random workspaces don’t sneak in). Use the metrics app to ensure no one workspace is hogging resources; if a particular department’s content is too heavy, maybe allocate them a dedicated capacity (e.g. buy another F64 for that department if justified).

Scenario 3: Data Science and Engineering Focus
A tech company with 200 data scientists and engineers plans to use Fabric for big data processing, machine learning, and some reporting. They expect heavy Spark usage and big warehouses; less focus on broad report consumption.
Approach: Since their usage is compute-heavy but not necessarily thousands of report viewers, they might prioritize raw power over Premium distribution. Possibly they could start with an F128 or F256, even if many of their users have Pro licenses anyway (so free-viewer capability isn’t the concern, capacity for compute is). They might split capacities by function: one “AI/Engineering” capacity and one “BI Reporting” capacity. The AI one might be large (to handle Spark clusters, etc.), and the BI one can be smaller if report usage is limited to internal teams with Pro. If cost is a concern, they could try an alternative: keep one moderate capacity and use Spark autoscale billing (serverless Spark) for big ML jobs so that those jobs don’t eat capacity – essentially offloading big ML to Azure Databricks or Spark outside of Fabric. But if they want everything in Fabric, an ample capacity with bursting will handle a lot. They should use Spark pool auto-scaling and perhaps set conservative defaults to avoid any single user grabbing too many cores. Monitor concurrency – if Spark jobs queue often, maybe increase capacity or encourage using pipeline scheduling to queue non-urgent jobs. For cost, they might run the capacity 24/7 if pipelines run round the clock. Still, if nights are quiet, pause then. Because these users are technical, requiring them to have Pro or PPU is fine; they may not need to enable free user access at all. If they do produce some dashboards for a wider audience, those could be on a smaller separate capacity (or they give those viewers PPU licenses). Overall, ensure the capacity is in a region close to the data lake for performance, and consider enabling private networking since they likely deal with secure data.

Scenario 4: Large Enterprise, Multiple Departments
A global enterprise with several divisions, all adopting Fabric for different projects – some heavy BI, some data warehousing, some real-time analytics.
Approach: This calls for a multi-capacity strategy. They might purchase a pool of capacity units (e.g., 500 CUs reserved) and then split into multiple capacities: e.g., an F128 for Division A, F128 for Division B, F64 for Division C, etc., up to the 500 CU total. This way each division can manage its own without impacting others, and the company benefits from a bulk reserved discount across all. They should designate a capacity admin for each to manage assignments. They should also be mindful of region – maybe an F128 in EU for the European teams, another in US for American teams. Use naming conventions for capacities (e.g., “Fabric_CAP_EU_Prod”, “Fabric_CAP_US_Marketing”). They might also keep one smaller capacity as a “sandbox” environment where any employee can try Fabric (kind of like a community capacity) – that one might be monitored and reset often. Cost-wise, they will want reserved instances for such scale and possibly 3-year commitments if confident (those might bring even greater discounts in the future). Regular reviews might reveal one division not using their full capacity – they could decide to resize that down and reallocate CUs to another that needs more (taking advantage of the flexibility that reserved CUs are not tied to one capacity shape). The governance here is crucial: a central team should set overall policies (like what content must be where, and ensure compliance and security are uniform), while delegating day-to-day to local admins.

Scenario 5: External Facing Embedded Analytics
A software vendor wants to use Fabric to embed Power BI reports in their SaaS product for their external customers.
Approach: This scenario historically used A-SKUs or EM-SKUs. With Fabric, they have options: they could use an F-SKU which also supports embedding, or stick with A-SKU if they don’t need Fabric features. If they only care about embedding reports and want to minimize cost, an A4 (equivalent to F64) might be slightly cheaper if they don’t need the rest of Fabric (plus A4 can be paused too). However, if they think of using Fabric’s dataflows or other features to prep data, going with an F-SKU might be more future-proof. Assuming they choose an F-SKU, they likely need at least F8 or F16 to start (depending on user load) because EM/A SKUs start at that scale for embedding anyway. They can scale as their customer base grows. They will treat this capacity as dedicated to their application. They should isolate it from internal corporate capacities. Cost optimization here is to scale with demand: e.g., scale up during business hours if that’s when customers use the app, and scale down at night or pause if no one accesses at 2 AM. But since external users might be worldwide, they might run it constantly and possibly consider multi-geo capacities to serve different regions for latency. They must also handle licensing properly: external users viewing embedded content do not need Pro licenses; the capacity covers that. So the capacity cost is directly related to usage the vendor expects (if many concurrent external users, need higher SKU). Monitoring usage patterns (peak concurrent users driving CPU) will guide scaling and cost.

These scenarios highlight that capacity management is flexible – you adapt the strategy to your specific needs and usage patterns. There is no one-size-fits-all, but the principles remain consistent: use data to make decisions, isolate where necessary, and take advantage of Fabric’s elasticity to optimize both performance and cost.

Conclusion

Microsoft Fabric capacities are a powerful enabler for organizational analytics at scale. By understanding the different capacity types, how to license and size them, and how Fabric allocates resources across workloads, administrators can ensure their users get a fast, seamless experience. We covered how to plan capacity size (using tools and trial runs), how to manage mixed workloads on a shared capacity, and how Fabric’s unique bursting and smoothing capabilities help handle peaks without constant overspending. We also delved into monitoring techniques to keep an eye on capacity health and discussed governance practices to allocate capacity resources wisely among teams and projects. Finally, we explored ways to optimize costs – from pausing unused capacity to leveraging reserved pricing and choosing the right licensing mix.

In essence, effective capacity management in Fabric requires a balance of technical tuning and organizational policy. Administrators should collaborate with business users and developers alike: optimizing queries and models (to reduce load), scheduling workloads smartly, and scaling infrastructure when needed. With careful management, a Fabric capacity can serve a wide array of analytics needs while maintaining strong performance and staying within budget. We encourage new capacity admins to start small, iterate, and use the rich monitoring data available – over time, you will develop an intuition for your organization’s usage patterns and how to adjust capacity to match. Microsoft Fabric’s capacities, when well-managed, will provide a robust, flexible foundation for your data-driven enterprise, allowing you to unlock insights without worrying that resources will be the bottleneck. Happy capacity managing!

Sources:

Microsoft Fabric documentation – Concepts and Licenses, Microsoft Learn
Microsoft Fabric documentation – Plan your capacity size, Microsoft Learn
Microsoft Fabric documentation – Evaluate and optimize your capacity, Microsoft Learn
Microsoft Fabric documentation – Capacity throttling policy, Microsoft Learn
Data – Marc blog – Power BI and Fabric capacities: Cost structure, June 2024
Microsoft Fabric documentation – Fabric trial license, Microsoft Learn
Microsoft Fabric documentation – Capacity settings (admin), Microsoft Learn
Dataroots.io – Fabric pricing, billing, and autoscaling, 2023
Medium – Adrian B. – Fabric Capacity Management 101, 2023
Microsoft Fabric documentation – Spark concurrency limits, Microsoft Learn
Microsoft Fabric community – Fabric trial capacity limits, 2023 (trial is 60 days)
Microsoft Fabric documentation – Throttling stages, Microsoft Learn

Download PDF copy – Microsoft Fabric Capacity Management_ A Comprehensive Guide for Administrators.pdf

Why Notebook Snapshots in Microsoft Fabric Are a Debugging Gamechanger—No, Seriously!

If you’ve ever experienced the sheer agony of debugging notebooks—those chaotic, tangled webs of code, markdown, and occasional tears—you’re about to understand exactly why Notebook Snapshots in Microsoft Fabric aren’t just helpful, they’re borderline miraculous. Imagine the emotional rollercoaster of meticulously crafting a beautifully intricate notebook, only to watch it crumble into cryptic errors and obscure stack traces with no clear clue of what went wrong, when, or how. Sound familiar? Welcome to notebook life.

But fear not, weary debugger. Microsoft Fabric is finally here to rescue your productivity—and possibly your sanity—through the absolute genius of Notebook Snapshots.

Let’s Set the Scene: The Notebook Debugging Nightmare

To fully appreciate the brilliance behind Notebook Snapshots, let’s first vividly recall the horrors of debugging notebooks without them.

Step 1: You enthusiastically write and run a series of notebook cells. Everything looks fine—until, mysteriously, it doesn’t.

Step 2: A wild error appears! Frantically, you scroll back up, scratching your head and questioning your life choices. Was it Cell 17, or perhaps Cell 43? Who knows at this point?

Step 3: You begin the tiresome quest of restarting the kernel, selectively re-running cells, attempting to recreate that perfect storm of chaos that birthed the bug. Hours pass, frustration mounts, coffee runs out—disaster ensues.

Sound familiar? Of course it does, we’ve all been there.

Enter Notebook Snapshots: The Hero We Didn’t Know We Needed

Notebook Snapshots in Microsoft Fabric aren’t simply another fancy “nice-to-have” feature; they’re an absolute lifeline for notebook developers. Essentially, Notebook Snapshots capture a complete state of your notebook at a specific point in time—code, outputs, errors, and all. They let you replay and meticulously analyze each step, preserving context like never before.

Think of them as your notebook’s personal rewind button: a time-traveling companion ready to transport you back to that critical moment when everything broke, but your optimism was still intact.

But Why Exactly is This Such a Gamechanger?

Great question—let’s get granular.

1. Precise State Preservation: Say Goodbye to Guesswork

The magic of Notebook Snapshots is in their precision. No more wondering which cell went rogue. Snapshots save the exact state of your notebook’s cells, outputs, variables, and even intermediate data transformations. This precision ensures that you can literally “rewind” and step through execution like you’re binging your favorite Netflix series. Missed something crucial? No worries, just rewind.

Benefit: You know exactly what the state was before disaster struck. Debugging transforms from vague guesswork to precise, surgical analysis. You’re no longer stumbling in the dark—you’re debugging in 4K clarity.

2. Faster Issue Replication: Less Coffee, More Debugging

Remember spending hours trying to reproduce obscure bugs that vanished into thin air the moment someone else was watching? Notebook Snapshots eliminate that drama. They capture the bug in action, making it infinitely easier to replicate, analyze, and ultimately squash.

Benefit: Debugging time shrinks dramatically. Your colleagues are impressed, your boss is delighted, and your coffee machine finally gets a break.

3. Collaboration Boost: Debug Together, Thrive Together

Notebook Snapshots enable teams to share exact notebook states effortlessly. Imagine sending your team a link that perfectly encapsulates your debugging context. No lengthy explanations needed, no screenshots required, and definitely no more awkward Slack messages like, “Ummm… it was working on my machine?”

Benefit: Everyone stays synchronized. Collective debugging becomes simple, fast, and—dare we say it—pleasant.

4. Historical Clarity: The Gift of Hindsight

Snapshots build a rich debugging history. You can examine multiple snapshots over time, comparing exactly how your notebook evolved and where problems emerged. You’re no longer relying on vague memory or frantic notebook archaeology.

Benefit: Clearer, smarter decision-making. You become a debugging detective with an archive of evidence at your fingertips.

5. Confidence Boosting: Fearless Experimentation

Knowing you have snapshots lets you innovate fearlessly. Go ahead—experiment wildly! Change parameters, test edge-cases, break things on purpose (just for fun)—because you can always rewind to a known-good state instantly.

Benefit: Debugging stops being intimidating. It becomes fun, bold, and explorative.

A Practical Example: Notebook Snapshots in Action

Imagine you’re exploring a complex data pipeline in a notebook:

You load and transform data.
You run a model.
Suddenly, disaster: a cryptic Python exception mocks you cruelly.

Normally, you’d have to painstakingly retrace your steps. With Microsoft Fabric Notebook Snapshots, the workflow is much simpler:

Instantly snapshot the notebook at the exact moment the error occurs.
Replay each cell execution leading to the error.
Examine exactly how data changed between steps—no guessing, just facts.
Swiftly isolate the issue, correct the bug, and move on with your life.

Just like that, you’ve gone from notebook-induced stress to complete debugging Zen.

A Bit of Sarcastic Humor for Good Measure

Honestly, if you’re still debugging notebooks without snapshots, it’s a bit like insisting on traveling by horse when teleportation exists. Sure, horses are charmingly nostalgic—but teleportation (aka Notebook Snapshots) is clearly superior, faster, and way less messy.

Or, put differently: debugging notebooks without snapshots in 2025 is like choosing VHS tapes over streaming. Sure, the retro vibes might be fun once—but let’s be honest, who wants to rewind tapes manually when you can simply click and replay?

Wrapping It All Up: Notebooks Just Got a Whole Lot Easier

In short, Notebook Snapshots in Microsoft Fabric aren’t merely a convenience—they fundamentally redefine how we approach notebook debugging. They shift the entire paradigm from guesswork and frustration toward clarity, precision, and confident experimentation.

Notebook developers everywhere can finally rejoice: your debugging nightmares are officially canceled.

Thanks, Microsoft Fabric—you’re genuinely a gamechanger.

This post was written with help from ChatGPT

Advanced Power BI Data Security: Row-Level Security and Data Masking Strategies with Code Samples

Data security is of paramount importance in any data-centric organization, and Power BI, Microsoft’s business analytics tool, offers robust data security measures. Two powerful features that significantly enhance Power BI data security are Row-Level Security (RLS) and Data Masking. This blog post will provide a deep dive into these two mechanisms and show practical code samples to help you better understand their implementation.

Row-Level Security (RLS)

RLS is a Power BI feature that controls data access at the row level based on user roles and their filters. It’s a versatile security strategy that allows different data access levels within the same report for different users. For instance, a regional manager can only access data related to their own region, while a salesperson can only see data related to their specific customers.

To implement RLS, follow the steps:

Create roles and define filters: In Power BI Desktop, navigate to the Modeling tab and click on Manage Roles. Here, you can define roles and set up row-level filters. For example, to create a role for a salesperson, click on Create and type the role name, such as Salesperson. Select the table you want to apply the filter to, write the DAX expression that defines the filter condition, and then click Save.

Example DAX expression for salesperson role:

[SalespersonName] = USERPRINCIPALNAME()

In this case, the salesperson can only see the rows where their name matches their user principal name.

Test your roles: After creating roles and defining filters, you can check how the data appears for each role. Click on View As Roles on the Modeling tab, select the role you want to view, and see how the data changes in the report view.
Publish the report and assign roles in Power BI Service: Once the report is ready, publish it to Power BI Service. Here, you can assign roles to users. Go to the dataset settings, select Security, and assign roles to users or groups. Remember that you need to have admin permissions to assign roles.

Please note that RLS does not apply to users with admin, member, or contributor roles in workspace access.

Data Masking

Data masking is a technique used to protect sensitive data by replacing it with fictitious data. This strategy is especially helpful when you need to hide specific data but the dataset’s overall structure is necessary.

Unfortunately, as of my knowledge cutoff in September 2021, Power BI does not directly support data masking. However, you can achieve similar results using DAX functions or Power Query transformations.

Using DAX

Create a calculated column with the DAX IF function to hide sensitive data. For example, if you want to mask the email addresses of your customers, you could use the following DAX expression:

Email Masked = IF([Role]="Salesperson", [Email], "*****")

In this example, if the user role is “Salesperson,” the email will be displayed. Otherwise, it will display asterisks.

Using Power Query

Power Query can also be used to mask data. For example, to mask the last four digits of a phone number:

Go to Edit Queries in Power BI Desktop.
Select the column with the phone numbers.
From the Add Column tab, select Custom Column.
Write a formula to mask the data.

Power Query = Text.Start([Phone], Text.Length([Phone]) - 4) & "****"

This formula will show the beginning of the phone number and replace the last four digits with asterisks.

Data security is an ongoing process and must be a priority in any organization. Row-Level Security and Data Masking are two strategies that can significantly improve your data security in Power BI. Though Power BI might not directly support data masking, creative use of DAX and Power Query can help achieve similar results.

Remember, data protection doesn’t stop at implementing security measures. Regular audits and reviews should be part of your data security strategy to ensure these measures are always up-to-date and effective.

This blogpost was created with help from ChatGPT Pro

Advanced Time Intelligence in Power BI: Calculations and Comparisons

A critical aspect of business analytics is understanding patterns, trends, and insights over time. Microsoft Power BI offers robust time intelligence features to analyze data at various time dimensions such as year, quarter, month, week, and day levels. This blog post will dive into advanced time intelligence in Power BI, with a focus on calculations and comparisons.

Understanding Time Intelligence

Time Intelligence is a term used to describe modeling methods and functions in Power BI that allow us to perform time-related calculations like Year to Date (YTD), Month to Date (MTD), and compare results with prior periods such as Last Year Same Period (LYSP) and Percent Change. This can provide valuable insights into data trends and business performance.

Basic Setup

Before performing advanced calculations, ensure your data model is set up correctly. The two basic requirements for time intelligence calculations in Power BI are:

A Date Table: Power BI requires a separate date table linked to your fact table(s) through relationships. This date table should be continuous and have no missing dates.
Establish Relationships: The date table needs to be connected to your data using relationships. The relationships should be active and single-directional for the time intelligence calculations to work correctly.

Key Time Intelligence Functions

Total Year to Date (YTD)

This calculation is used to evaluate the total value from the beginning of the year up to the current date. The DATESYTD function can be used to create a YTD calculation:

Total Sales YTD = 
CALCULATE(
    SUM([Sales]),
    DATESYTD('Date'[Date])
)

Month to Date (MTD) and Quarter to Date (QTD)

Similar to YTD, MTD and QTD calculations evaluate the total from the beginning of the month or quarter up to the current date. You can use DATESMTD and DATESQTD functions respectively.

Previous Period

The earlier period’s data is often used as a benchmark. You can use functions like PREVIOUSDAY, PREVIOUSMONTH, PREVIOUSQUARTER, and PREVIOUSYEAR to retrieve the data from the previous period.

Sales Previous Year = 
CALCULATE(
    SUM([Sales]),
    PREVIOUSYEAR('Date'[Date])
)

Same Period Last Year (SPLY)

This calculation allows you to compare the current performance with the performance of the same period last year.

Sales SPLY = 
CALCULATE(
    SUM([Sales]),
    SAMEPERIODLASTYEAR('Date'[Date])
)

Making Comparisons

Once you have the calculations for the current period and the previous period (or the same period last year), you can create measures to make comparisons.

For example, to calculate the growth in sales compared to the previous year, you can create a measure like this:

Sales Growth = 
([Total Sales YTD] - [Sales Previous Year]) / [Sales Previous Year]

This measure will provide the sales growth in terms of percentage.

Advanced Time Intelligence Calculations

Moving Averages

Moving averages are used to smooth out short-term fluctuations and highlight longer-term trends. The averagex function combined with datesinperiod or datesbetween can be used to calculate moving averages.

12 Month Moving Average = 
AVERAGEX(
    DATESINPERIOD('Date'[Date], LASTDATE('Date'[Date]), -12, MONTH),
    [Total Sales]
)

Cumulative Totals

Cumulative totals or running totals are used to display the sum of a measure up to a certain date.

Cumulative Sales = 
CALCULATE(
    SUM([Sales]),
    FILTER(
        ALLSELECTED('Date'),
        'Date'[Date] <= MAX('Date'[Date])
    )
)

Comparing Non-Consecutive Periods

Power BI offers a great deal of flexibility to compare non-consecutive periods. For example, if you want to compare the sales of Q2 this year with Q4 last year, you can use the function DATEADD.

Sales Q4 Last Year = 
CALCULATE(
    SUM([Sales]),
    DATEADD('Date'[Date], -2, QUARTER)
)

In conclusion, Power BI offers a variety of time intelligence functions to cater to various business needs. With a proper understanding of these functions, you can perform complex time-based calculations and comparisons to gain deeper insights into your data. Remember, it’s not just about creating measures and visuals, but about uncovering meaningful information to aid decision-making. As always, practice makes perfect, so don’t hesitate to experiment with these functions in your Power BI reports.

This blogpost was created with help from ChatGPT Pro

Recap of Guy in a Cube Livestream for June 17th, 2023

Here’s a recap of the YouTube video titled “Power BI and Azure Synapse Analytics (formerly SQL Data Warehouse) – Guy in a Cube” by Patrick LeBlanc and Adam Saxton from June 17th.

The video starts with an introduction to Azure Synapse Analytics, formerly known as SQL Data Warehouse, and its integration with Power BI. The hosts, Patrick and Adam, discuss the benefits of using Synapse Analytics with Power BI, including the ability to handle large volumes of data and perform complex transformations.

They then demonstrate how to use Azure Synapse Studio, a unified web user interface for managing and monitoring your Azure Synapse Analytics workspace. They show how to create a new workspace, load data into a data frame, and use the Data Wrangler tool to manipulate and transform the data.

The Data Wrangler tool is compared to Power Query in Power BI, as it generates Python code for each transformation step, similar to how Power Query generates M code. They show how to drop columns, add new ones, and perform other transformations using the tool. They also highlight the need for a “close and apply” feature to write the transformed data back into the lake house.

The hosts also discuss the use of VS Code with Azure Synapse Analytics and Power BI, and how it can be used to add code to a notebook. However, they note that additional steps are needed to write the transformed data back into the lake house.

They also discuss the use of Python and Spark in Azure Synapse Analytics, and how they can be used to perform more complex transformations and analyses. They show how to use the pandas library to import and normalize JSON data, and how to convert it to a Spark data frame.

The video concludes with a Q&A session, where the hosts answer questions from the audience. They discuss topics such as data residency, best practices for migrating from Google Data Studio to Power BI, and the importance of taking time off work.

Overall, the video provides a comprehensive overview of Azure Synapse Analytics and its integration with Power BI and offers practical tips and demonstrations for using these tools effectively.

This blogpost was created with help from ChatGPT Pro and using the Voxscript plugin

Advanced Data Analysis with Power BI: Leveraging Statistical Functions

Microsoft Power BI is a powerful tool that helps businesses and individuals transform their raw data into actionable insights. One of its most powerful features is the ability to perform advanced data analysis through its comprehensive suite of statistical functions. This blog post will delve into using these functions effectively, giving you a better understanding of your data, and improving your decision-making process.

Let’s start by understanding Power BI a bit better.

Power BI: A Brief Overview

Power BI is a business analytics tool suite that provides interactive visualizations with self-service business intelligence capabilities. Users can create reports and dashboards without any technical knowledge, making it easier for everyone to understand the data. Power BI offers data extraction from multiple heterogeneous data sources, including Excel files, SQL Server, and cloud-based sources like Azure SQL Database, Salesforce, etc.

Leveraging Statistical Functions in Power BI

Power BI is capable of conducting high-level statistical analysis thanks to DAX (Data Analysis Expressions) – a library of functions used in Power BI, Analysis Services, and Power Pivot in Excel. DAX includes a variety of functions such as aggregation functions, date and time functions, mathematical functions, statistical functions, and more.

To start with, we will discuss some of the commonly used statistical functions and how to apply them.

1. AVERAGE and AVERAGEA

The AVERAGE function calculates the mean of a column of numbers. AVERAGEA does the same, but it evaluates TRUE and FALSE as 1 and 0, respectively.

Here’s an example:

AVERAGE ( Sales[Quantity] )
AVERAGEA ( Sales[Quantity] )

The first expression calculates the average of the Quantity column in the Sales table, ignoring any TRUE or FALSE values. The second expression, however, will include these boolean values.

2. COUNT and COUNTA

COUNT function counts the number of rows in a column that contain a number or an expression that evaluates to a number. On the other hand, COUNTA counts the number of rows in a column that are not blank.

COUNT ( Sales[Quantity] )
COUNTA ( Sales[Product] )

The first expression counts the number of rows in the Quantity column of the Sales table that contains a number. The second one counts the number of non-blank rows in the Product column of the Sales table.

3. MIN and MAX

MIN and MAX return the smallest and largest numbers in a numeric dataset, respectively.

MIN ( Sales[Price] )
MAX ( Sales[Price] )

The first expression finds the smallest price in the Price column of the Sales table. The second expression returns the highest price.

4. STDEV.P and STDEV.S

STDEV.P function calculates standard deviation based on the entire population given as arguments. STDEV.S calculates standard deviation based on a sample.

STDEV.P ( Sales[Price] )
STDEV.S ( Sales[Price] )

The first expression calculates the standard deviation of the entire population of prices in the Price column of the Sales table. The second calculates the standard deviation based on a sample.

Implementing Statistical Functions in Power BI: An Example

Let’s demonstrate the implementation of these statistical functions in Power BI with a hypothetical data set. Let’s assume we have a “Sales” table with the following columns: OrderID, Product, Quantity, and Price.

To calculate the average quantity sold, we would create a new measure:

Average Quantity = AVERAGE ( Sales[Quantity] )

We can then use this measure in our reports to get the average quantity of products sold.

To find out the number of unique products sold, we would use the COUNTA function:

Number of Products = COUNTA ( Sales[Product] )

Finally, to find out the standard deviation of prices, we would use the STDEV.P function:

Price Standard Deviation = STDEV.P ( Sales[Price] )

We can now use these measures in our reports and dashboards to provide a statistical analysis of our sales data.

Conclusion

Understanding statistical functions in Power BI can provide meaningful insights into data. With a broad range of statistical functions available in DAX, you can perform advanced data analysis with ease. This blog post has introduced you to the concept and shown you how to leverage these functions. However, the scope of Power BI’s statistical capabilities goes far beyond these basics. As you get more comfortable, you can explore more complex statistical functions and techniques to gain deeper insights into your data.

Remember, it’s not about the complexity of the analysis you’re performing but about how well you’re able to use that analysis to derive actionable insights for your business or organization. Happy analyzing!

This blogpost was created with help from ChatGPT Pro

1) The execution hierarchy: Application → Job → Stage → Task

2) Lazy evaluation: why “nothing happens” until an action

3) Shuffles: the moment your DAG turns expensive

4) What to check in Fabric: jobs, stages, tasks

5) A repeatable debugging workflow (that scales)

Quick checklist

References

Share this:

1) Start with the simplest win: avoid the shuffle

2) Make the shuffle less painful: tune shuffle parallelism

3) Let Spark fix itself (when it can): enable AQE

4) Diagnose like you mean it: what to look for in Spark UI

A minimal checklist for Fabric Spark teams

References

Share this:

The core idea: stop guessing, start diagnosing

Why sparkwise exists (and the problems it explicitly targets)

What you get: a feature tour that maps to real-world Spark pain

1) Automated diagnostics (the fast “what’s wrong?” pass)

2) Comprehensive profiling (the “what actually happened?” pass)

3) Advanced performance analysis (built on real metrics)

4) Advanced skew detection (because skew kills Spark)

5) SQL query plan analysis (spotting anti-patterns early)

6) Storage optimization suite (new in v1.4.0+)

7) Interactive configuration assistant (the “what does this do?” superpower)

Quick start: the 3 fastest ways to get value

Install

1) Run a full diagnostic on your current session

2) Ask about a specific Spark/Fabric config

3) Profile your run (and pinpoint bottlenecks)

CLI workflows (especially useful for storage optimization)

A realistic “first hour” workflow I’d recommend

Closing: why this matters for Fabric teams

Share this:

What Is the Fabric “Job Doctor”?

Part 1: Getting Spark Telemetry Out of Fabric

1. Configure the Fabric Apache Spark Diagnostic Emitter

2. Shape of the Raw Logs (and Why You’ll Normalize Them)

3. Capturing Query Plans in Fabric (Optional, but Powerful)

4. Capture Spark Config for Each Run

Part 2: Loading and Normalizing Spark Logs in a Fabric Lakehouse

Aggregating Stage Metrics in Fabric

Part 3: Adding a Rule Engine Inside Fabric

Part 4: Bringing in the LLM — Turning Metrics into Diagnosis

1. Prepare the Prompt Payload

2. Call an Azure AI Model from Fabric Spark

Part 5: What the Job Doctor’s Output Looks Like in Fabric

🔎 Issue 1: Skewed Stage 4 (skew ratio 12.3)

📦 Issue 2: Large Shuffle in Stage 2 (~19.7 GB)

💾 Issue 3: Spill to Disk (~1.8 GB) in Stage 3

Part 6: Stitching It All Together in Fabric

1. Telemetry Ingestion (Environment / Emitter)

2. Normalization Job (Spark / Data Pipeline)

3. AI Diagnosis Job (Spark + Azure AI Models)

4. User Experience

Why a Fabric-Specific Job Doctor Is Worth Building

Share this:

Executive Summary

Introduction to Microsoft Fabric Capacities

Capacity SKU Types and Differences (F, P, A, EM, Trial)

Licensing and Provisioning

Capacity Planning and Sizing Guidance

Workload Management Across Fabric Experiences

Isolation and Security Boundaries

Monitoring and Metrics

Autoscale and Bursting: Managing Peak Loads

Governance and Best Practices for Capacity Assignment

Cost Optimization Strategies

Real-World Use Cases and Scenario-Based Recommendations

Conclusion

Share this:

Let’s Set the Scene: The Notebook Debugging Nightmare

Enter Notebook Snapshots: The Hero We Didn’t Know We Needed

But Why Exactly is This Such a Gamechanger?

1. Precise State Preservation: Say Goodbye to Guesswork

2. Faster Issue Replication: Less Coffee, More Debugging

3. Collaboration Boost: Debug Together, Thrive Together

4. Historical Clarity: The Gift of Hindsight

5. Confidence Boosting: Fearless Experimentation

A Practical Example: Notebook Snapshots in Action