Microsoft published a new end-to-end pattern last week. Train a model inside Fabric. Score it against a governed semantic model. Push predictions straight into Power BI. No data exports. No credential juggling.
The blog post walks through a churn-prediction scenario. Semantic Link pulls data from a governed Power BI semantic model. MLflow tracks experiments and registers models. The PREDICT function runs batch inference in Spark. Real-time endpoints serve predictions through Dataflow Gen2. Everything lives in one workspace, one security context, one OneLake.
It reads well. It demos well.
But demo code is not production code. The gap between “it runs in my notebook” and “it runs every Tuesday at 4 AM without paging anyone” is exactly where Fabric Spark teams bleed time.
This is the checklist for crossing that gap.
Prerequisites that actually matter
The official blog assumes a Fabric-enabled workspace and a published semantic model. That is the starting line. Production is a different race.
Capacity planning comes first. Fabric Spark clusters consume capacity units. A batch scoring job running on an F64 during peak BI refresh hours competes for the same CUs your report viewers need. Run scoring in off-peak windows, or provision a separate capacity for data science workloads. Either way, know your CU ceiling before your first experiment. Discovering your scoring job throttles the CFO’s dashboard refresh is not a conversation you want to have.
Workspace isolation is not optional. Dev, test, prod. Semantic models promoted through deployment pipelines. ML experiments pinned to dev. Registered models promoted to prod only after validation passes. If your team trains models in the same workspace where finance runs their quarterly close dashboard, you are one accidental publish away from explaining why the revenue numbers just changed.
MLflow model signatures must be populated from day one. The PREDICT function requires them. No signature, no batch scoring. This constraint is easy to forget during prototyping and expensive to fix later. Make it a rule: every mlflow.sklearn.log_model call includes an infer_signature output. No exceptions. Write a pre-commit hook if you have to.
Semantic Link: the part most teams underestimate
Semantic Link connects your Power BI semantic model to your Spark notebooks. Call fabric.read_table() and you get governed data. Same measures and definitions your business users see in their reports. The data in your model’s training set matches what shows up in Power BI.
This matters more than it sounds.
Every analytics team that has been around long enough has a story about metric inconsistency. “Active customer” means one thing in the DAX model, another thing in the SQL pipeline, and a third thing in the data scientist’s Python notebook. The numbers diverge. Somebody notices. A week of forensic reconciliation follows.
Semantic Link kills that problem at the root. But only if you use it deliberately.
Start with fabric.list_measures(). Audit what DAX measures exist. Understand which ones your model depends on. Then pull data with fabric.read_table() rather than querying lakehouse tables directly. When you need to engineer features beyond what the semantic model provides, document every derivation in a version-controlled notebook. Written down and committed. Not living in someone’s memory or buried in a thread.
Training guardrails worth building
The Fabric blog shows a clean LightGBM training flow with MLflow autologging. That is the happy path. Production needs the unhappy path covered too.
Validate data before training. Check row counts against expected baselines. Check for null spikes in key columns. Check that the class distribution has not shifted beyond your predefined threshold. A model trained on corrupted or stale data produces confident garbage. Confident garbage is worse than no model at all, because people act on it.
Tag every experiment run. MLflow in Fabric supports custom tags. Use them aggressively. Tag each run with the semantic model version it pulled from, the notebook commit hash, and the data snapshot date. Three months from now, when a stakeholder asks why the model flagged 200 customers as high churn risk and zero of them actually left, you need to reconstruct exactly what happened. Without tags, you are guessing.
Build a champion-challenger gate. Before any new model version reaches production, it must beat the current model on a holdout set from the most recent data. Not any holdout set. The most recent one. Automate this comparison in a validation notebook that runs as a pipeline step before model registration. If the challenger fails to clear the margin you defined upfront, the pipeline halts. No override button. No “let’s just push it and see.” The gate exists to prevent optimism from substituting for evidence.
Batch scoring: the PREDICT function in production
Fabric’s PREDICT function is straightforward. Pass a registered MLflow model and a Spark DataFrame. Get predictions back. It supports scikit-learn, LightGBM, XGBoost, CatBoost, ONNX, PyTorch, TensorFlow, Keras, Spark, Statsmodels, and Prophet.
The production requirements are few but absolute.
Write predictions to a delta table in OneLake. Not to a temporary DataFrame that dies with the session. Partition that table by scoring date. Add a column for the model version that generated each row. This is your audit trail. When someone asks “why did customer 4471 show as high risk last Tuesday?”, you pull the partition, check the model version, and have an answer in minutes. Without that structure, the same question costs you a day.
Chain your scoring job to run after your semantic model refresh. Sequence matters. If the model scores data from the prior refresh cycle, your predictions are one step behind reality. Use Fabric pipelines to enforce the dependency explicitly. Refresh completes, scoring starts.
Real-time endpoints: know exactly what you are signing up for
Fabric now offers ML model endpoints in preview. Activate one from the model registry. Fabric spins up managed containers and gives you a REST API. Dataflow Gen2 can call the endpoint during data ingestion, enriching rows with predictions in flight.
The capability is real. The constraints are also real.
Real-time endpoints support a limited set of model flavors: Keras, LightGBM, scikit-learn, XGBoost, and (since January 2026) AutoML-trained models. PyTorch, TensorFlow, and ONNX are not supported for real-time serving. If your production model uses one of those frameworks, batch scoring is your only path.
The auto-sleep feature deserves attention. Endpoints scale capacity to zero after five minutes without traffic. The first request after sleep incurs a cold-start delay while containers spin back up. For use cases that need consistent sub-second latency, you have two options: disable auto-sleep and accept the continuous capacity cost, or send periodic synthetic requests to keep the endpoint warm.
The word “preview” is load-bearing here. Preview means the API can change between updates. Preview means SLAs are limited. Preview means you need a batch-scoring fallback in place before you route any production workflow through a real-time endpoint. Build the fallback first. Test it. Then add the real-time path as an optimization on top.
The rollback plan you need to write before you ship
Most teams build forward. They write the training pipeline, the scoring job, the endpoint, the Power BI report that consumes predictions. Then they ship.
Nobody writes the backward path. Until something goes wrong.
Your rollback plan has three parts.
First, keep at least two prior model versions in the registry. If the current version starts producing bad predictions, you roll back by updating the model alias. One API call. The scoring pipeline picks up the previous version on its next run.
Second, partition prediction tables by date and model version. Rolling back a model means nothing if downstream reports still display the bad predictions. With partitioned tables, you can filter or drop the scoring run from the misbehaving version and revert to the prior run’s output.
Third, a kill switch for real-time endpoints. One API call to deactivate the endpoint. Traffic falls back to the latest batch-scored delta table. Your Power BI report keeps working, just without real-time enrichment, while you figure out what went wrong.
Test this plan. Not on paper. Run the rollback end to end in your dev environment. Time it. If reverting to a stable state takes longer than fifteen minutes, your plan is too complicated. Simplify it until the timer clears.
Ship it
The architecture Microsoft described is sound. Semantic Link for governed data access. MLflow for experiment tracking and model registration. PREDICT for batch scoring to OneLake. Real-time endpoints for low-latency enrichment. Power BI consuming prediction tables through DirectLake or import.
But architecture alone does not keep a system running at 4 AM. The capacity plan does. The workspace isolation does. The data validation gate, the champion-challenger check, the scoring sequence, the endpoint fallback, the rollback drill. Those are what separate a demo from a service.
Do the checklist. Test the failure modes. Then ship.
This post was written with help from anthropic/claude-opus-4-6
Spark tuning often starts with the usual suspects (shuffle volume, skew, join strategy, caching)… but sometimes the biggest win is simply executing the same logical plan on a faster engine.
Microsoft Fabric’s Native Execution Engine (NEE) does exactly that: it keeps Spark’s APIs and control plane, but runs a large portion of Spark SQL / DataFrame execution on a vectorized C++ engine.
What NEE is (and why it’s fast)
NEE is a vectorized native engine that integrates into Fabric Spark and can accelerate many SQL/DataFrame operators without you rewriting your code.
You still write Spark SQL / DataFrames.
Spark still handles distributed execution and scheduling.
For supported operators, compute is offloaded to a native engine (reducing JVM overhead and using columnar/vectorized execution).
Fabric documentation calls out NEE as being based on Apache Gluten (the Spark-to-native glue layer) and Velox (the native execution library).
For Spark Job Definitions, add the same Spark property.
3) Disable/enable per-query when you hit unsupported features
If a specific query uses an unsupported operator/expression and you want to force JVM Spark for that query:
SET spark.native.enabled=FALSE; -- run the query SET spark.native.enabled=TRUE;
How to confirm NEE is actually being used
Two low-friction checks:
Spark UI / History Server: look for plan nodes ending with Transformer or nodes like *NativeFileScan / VeloxColumnarToRowExec.
df.explain(): the same Transformer / NativeFileScan / Velox… hints should appear in the plan.
Fabric also exposes a dedicated view (“Gluten SQL / DataFrame”) to help spot which queries ran on the native engine vs. fell back.
Fallback is a feature (but you should know the common triggers)
NEE includes an automatic fallback mechanism: if the plan contains unsupported features, Spark will run that portion on the JVM engine.
A few notable limitations called out in Fabric documentation:
UDFs aren’t supported (fallback)
Structured streaming isn’t supported (fallback)
File formats like CSV/JSON/XML aren’t accelerated
ANSI mode isn’t supported
There are also some behavioral differences worth remembering (rounding/casting edge cases) if you have strict numeric expectations.
A pragmatic “NEE-first” optimization workflow
Turn NEE on for the environment (or your job) and rerun the workload.
If it’s still slow, open the plan and answer: is the slow part running on the native engine, or did it fall back?
If it fell back, make the smallest possible change to keep the query on the native path (e.g., avoid UDFs; prefer built-in expressions; standardize on Parquet/Delta).
Once the plan stays mostly native, go back to classic Spark tuning: reduce shuffle volume, fix skew, sane partitioning, and confirm broadcast joins.
Here’s a counterintuitive claim: the most important announcement for Fabric Spark teams in early 2026 has nothing to do with Spark.
It’s a SQL database.
Specifically, it’s the rapid adoption of SQL database in Microsoft Fabric—a fully managed, SaaS-native transactional database that went GA in November 2025 and has been quietly reshaping how production data flows into lakehouse architectures ever since. If you’re a data engineer running Spark workloads in Fabric, this changes more than you think.
The ETL Pipeline You Can Delete
Most Spark data engineers have a familiar pain point: getting operational data from transactional systems into the lakehouse. You build ingestion pipelines. You schedule nightly batch loads. You wrestle with CDC (change data capture) configurations, watermark columns, and retry logic. You maintain all of it, forever.
SQL database in Fabric eliminates that entire layer.
When data lands in a Fabric SQL database, it’s automatically replicated to OneLake as Delta tables in near real-time. No pipelines. No Spark ingestion jobs. No orchestration. The data just appears, already in the open Delta format your notebooks and Spark jobs expect.
This isn’t a minor convenience—it’s an architectural shift. Every ingestion pipeline you don’t write is a pipeline you don’t debug at 2 AM.
What This Actually Looks Like in Practice
Let’s say you’re building an analytics layer on top of an operational SaaS application. Today, your architecture probably looks something like this:
Application writes to Azure SQL or Cosmos DB
ADF or Spark job pulls data on a schedule
Data lands in a lakehouse as Delta tables
Downstream Spark jobs transform and aggregate
With SQL database in Fabric, steps 2 and 3 vanish. Your application writes directly to the Fabric SQL database, and the mirrored Delta tables are immediately available for Spark processing. Here’s what your downstream notebook looks like now:
# Read operational data directly — no ingestion pipeline needed # The SQL database auto-mirrors to OneLake as Delta tables orders_df = spark.read.format("delta").load( "abfss://your-workspace@onelake.dfs.fabric.microsoft.com/your-sqldb.SQLDatabase/dbo.Orders" ) # Your transformation logic stays the same from pyspark.sql import functions as F daily_revenue = ( orders_df .filter(F.col("order_date") >= F.date_sub(F.current_date(), 7)) .groupBy("product_category") .agg( F.sum("total_amount").alias("revenue"), F.countDistinct("customer_id").alias("unique_customers") ) .orderBy(F.desc("revenue")) ) daily_revenue.write.format("delta").mode("overwrite").saveAsTable("gold.weekly_revenue_by_category")
The Spark code doesn’t change. What changes is everything upstream of it.
The Migration Risk Nobody’s Talking About
Here’s where it gets interesting—and where Malcolm Gladwell would lean forward in his chair. The biggest risk of SQL database in Fabric isn’t technical. It’s organizational.
Teams that have invested heavily in ingestion infrastructure will face a classic innovator’s dilemma: the new path is simpler, but the old path already works. The temptation is to keep running your existing ADF pipelines alongside the new auto-mirroring capability, creating a hybrid architecture that’s worse than either approach alone.
My recommendation: don’t hybrid. Pick a workload, migrate it end-to-end, and measure. Here’s a concrete rollout checklist:
Identify a candidate workload — Look for Spark jobs whose primary purpose is pulling data from a SQL source into Delta tables. These are your highest-value migration targets.
Provision a Fabric SQL database — It takes seconds. You provide a name; Fabric handles the rest. Autoscaling and auto-pause are built in.
Redirect your application writes — Point your operational application to the new Fabric SQL database. The engine is the same SQL Database Engine as Azure SQL, so T-SQL compatibility is high.
Validate the Delta mirror — Confirm that your data is appearing in OneLake. Check schema fidelity, latency, and row counts:
# In your Spark notebook, validate the mirrored data spark.sql(""" SELECT COUNT(*) as row_count, MAX(modified_date) as latest_record, MIN(modified_date) as earliest_record FROM your_sqldb.dbo.Orders """).show()
Decommission the ingestion pipeline — Once validated, turn off the ADF or Spark ingestion job. Don’t just disable it—delete it. Zombie pipelines are how technical debt accumulates.
Update your monitoring — Your existing data quality checks should still work since the Delta tables have the same schema. But update your alerting to watch for mirror latency instead of pipeline run failures.
The AI Angle Matters for Spark Teams Too
There’s a second dimension to this announcement that Spark engineers should pay attention to: the native vector data type in SQL database supports semantic search and RAG patterns directly in the transactional layer.
Why does that matter for Spark teams? Because it means your embedding pipelines can write vectors back to the same database your application reads from—closing the loop between batch ML processing in Spark and real-time serving in SQL. Instead of maintaining a separate vector store (Pinecone, Qdrant, etc.), you use the same SQL database that’s already mirrored into your lakehouse.
This is the kind of architectural simplification that compounds over time. Fewer systems means fewer failure modes, fewer credentials to manage, and fewer things to explain to your successor.
The Rollout Checklist
This week: Inventory your existing ingestion pipelines. How many just move data from SQL sources to Delta?
This sprint: Provision a Fabric SQL database and test the auto-mirror with a non-critical workload.
This quarter: Migrate your highest-volume ingestion pipeline and measure CU savings.
Track: Mirror latency, CU consumption before/after, and pipeline maintenance hours eliminated.
SQL database in Fabric went GA in November 2025 with enterprise features including row-level security, customer-managed keys, and private endpoints. For the full list of GA capabilities, see the official announcement. To understand how this fits into the broader Microsoft database + Fabric integration strategy, read Microsoft Databases and Microsoft Fabric: Your unified and AI-powered data estate. For Spark-specific Delta Lake concepts, the Delta Lake documentation remains the authoritative reference.
The best thing about this announcement isn’t any single feature. It’s that it makes your Spark architecture simpler by removing the parts that shouldn’t have been there in the first place.
This post was written with help from Claude Opus 4.6
If Spark performance work is surgery, monitoring is your live telemetry.
Microsoft Fabric gives you multiple monitoring entry points for Spark workloads: Monitor hub for cross-item visibility, item Recent runs for focused context, and application detail pages for deep investigation. This post is a practical playbook for using those together.
Why this matters
When a notebook or Spark job definition slows down, “run it again” is the most expensive way to debug. Real-time monitoring helps you:
spot bottlenecks while jobs are still running
isolate failures quickly
compare behavior across submitters and workspaces
1) Start at the Monitoring hub for cross-workspace triage
Use Monitoring in the Fabric navigation pane as your control tower.
Filter by item type (Notebook, Spark job definition, Pipeline)
Narrow by start time and workspace
Sort by duration or status to surface outliers
For broad triage, this is faster than jumping directly into individual notebooks.
2) Pivot to Spark application details for root-cause analysis
Once you identify a problematic run, open the Spark application detail page and work through tabs in order:
Jobs: status, stages, tasks, duration, and processed/read/written data
Resources: executor allocation and utilization in near real time
Logs: inspect Livy, Prelaunch, and Driver logs; download when needed
Item snapshots: confirm exactly what code/parameters/settings were used at execution time
This sequence prevents false fixes where you tune the wrong layer.
3) Use notebook contextual monitoring while developing
For iterative tuning, notebook contextual monitoring keeps authoring, execution, and debugging in one place.
Run a target cell/workload
Watch job/stage/task progress and executor behavior
Jump to Spark UI or detail monitoring for deeper traces
Adjust code or config and rerun
4) A lightweight real-time runbook
Confirm scope in the Monitoring hub (single run or systemic pattern)
Open application details for the failing/slower run
Check Jobs for stage/task imbalance and long-running segments
Check Resources for executor pressure
Check Logs for explicit failure signals
Verify snapshots so you debug the exact submitted artifact
Common mistakes to avoid
Debugging from memory instead of snapshots
Looking only at notebook cell output and skipping Logs/Resources
Treating one anomalous run as a global trend without Monitor hub filtering
If your Lakehouse tables are getting slower (or more expensive) over time, it’s often not “Spark is slow.” It’s usually table layout drift: too many small files, suboptimal clustering, and old files piling up.
In Fabric Lakehouse, the three table-maintenance levers you’ll reach for most are:
OPTIMIZE: compacts many small files into fewer, larger files (and can apply clustering)
Z-ORDER: co-locates related values to improve data skipping for common filters
VACUUM: deletes old files that are no longer referenced by the Delta transaction log (after a retention window)
Practical note: in Fabric, run these as Spark SQL in a notebook or Spark job definition (or use the Lakehouse maintenance UI). Don’t try to run them in the SQL Analytics Endpoint.
1) Start with the symptom: “small files” vs “bad clustering”
Before you reach for maintenance, quickly sanity-check what you’re fighting:
Many small files → queries spend time opening/reading lots of tiny Parquet files.
Poor clustering for your most common predicates (date, tenantId, customerId, region, etc.) → queries scan more data than they need.
Heavy UPDATE/DELETE/MERGE patterns → lots of new files + tombstones + time travel files.
If you only have small files, OPTIMIZE is usually your first win.
2) OPTIMIZE: bin-packing for fewer, bigger files
Basic compaction
OPTIMIZE my_table;
Target a subset (example: recent partitions)
OPTIMIZE my_table WHERE date >= date_sub(current_date(), 7);
A useful mental model: OPTIMIZE is rewriting file layout (not changing table results). It’s maintenance, not transformation.
3) Z-ORDER: make your filters cheaper
Z-Ordering is for the case where you frequently query:
WHERE tenantId = ...
WHERE customerId = ...
WHERE deviceId = ... AND eventTime BETWEEN ...
Example:
OPTIMIZE my_table ZORDER BY (tenantId, eventDate);
Pick 1–3 columns that dominate your interactive workloads. If you try to z-order on everything, you’ll mostly burn compute for little benefit.
4) VACUUM: clean up old, unreferenced files (carefully)
VACUUM is about storage hygiene. Delta keeps old files around to support time travel and concurrent readers. VACUUM deletes files that are no longer referenced and older than the configured retention threshold.
VACUUM my_table;
Two practical rules:
Don’t VACUUM aggressively unless you understand the impact on time travel / rollback.
Treat the retention window as a governance decision (what rollback window do you want?) not just a cost optimization.
5) Fabric-specific gotchas (the ones that actually bite)
Where you can run these commands
These are Spark SQL maintenance commands. In Fabric, that means notebooks / Spark job definitions (or the Lakehouse maintenance UI), not the SQL Analytics Endpoint.
V-Order and OPTIMIZE
Fabric also has V-Order, which is a Parquet layout optimization aimed at faster reads across Fabric engines. If you’re primarily optimizing for downstream read performance (Power BI/SQL/Spark), it’s worth understanding whether V-Order is enabled for your workspace and table writes.
If your Fabric tenant has grown past “a handful of workspaces,” the problem isn’t just storage or compute—it’s finding the right items, understanding what they are, and making governance actionable.
That’s the motivation behind the OneLake catalog: a central hub to discover and manage Fabric content, with dedicated experiences for discovery (Explore), governance posture (Govern), and security administration (Secure).
This post is a practical walk-through of what’s available today, with extra focus on what Fabric admins get in the Govern experience.
What is the OneLake catalog?
Microsoft describes the OneLake catalog as a centralized place to find, explore, and use Fabric items—and to govern the data you own.
You open it from the Fabric navigation pane by selecting the OneLake icon.
Explore tab: tenant-wide discovery without losing context
The Explore tab is the “inventory + details” experience:
An items list of Fabric content you can access (and in some cases, content you can request access to).
An in-context details pane so you can inspect an item without navigating away from your filtered list.
Filters and selectors to narrow scope (for example: workspace, item-type categories, endorsement, and tags).
A key pattern here is fast triage: filter down to a domain/workspace, then click through items to answer:
Who owns this?
Where does it live?
When was it refreshed?
Is it endorsed/certified?
Does it have sensitivity labeling?
Tip for data engineers
If your tenant uses domains, scoping the catalog to a domain/subdomain is often the quickest way to keep the item list meaningful—especially when teams create similar notebooks/pipelines across many workspaces.
Govern tab: governance posture + recommended actions
The Govern tab is where the catalog becomes more than “a directory.” It combines:
Insights (high-level indicators you can drill into)
The Govern tab behaves differently depending on who you are:
Fabric admins see insights based on tenant metadata (items, workspaces, capacities, domains).
Data owners see insights scoped to items they own (using the My items concept).
The Fabric blog also calls out a preview experience that extends the OneLake catalog governance view for Fabric admins, providing consolidated indicators and deeper drill-down reporting.
What admins see on the Govern tab
From the Fabric admin perspective, the Govern experience is designed to answer:
What does our data estate look like (inventory, distribution, usage)?
Where are we under-labeled or non-compliant (sensitivity coverage, policy posture)?
What content is hard to trust or reuse (freshness, endorsement/description/tag coverage, sharing patterns)?
When admins choose View more, Learn documentation describes an expanded report with three areas:
Manage your data estate (inventory, capacities/domains, feature usage)
Protect, secure & comply (sensitivity label coverage and data loss prevention policy posture)
Discover, trust, and reuse (freshness, curation signals such as endorsement/description coverage, sharing)
A detail worth knowing: refresh cadence differs for admins
Per Microsoft Learn, admin insights and actions are based on Admin Monitoring Storage data and refresh automatically every day, so there can be a lag between changes you make and what the Govern insights reflect.
Secure tab: centralized security role management
The OneLake catalog Secure tab is a security administration surface that centralizes:
Workspace roles and permissions (for auditing access)
OneLake security roles across workspaces and item types
From the Secure tab, admins can create, edit, or delete OneLake security roles from a single location.
A practical workflow to adopt (teams + admins)
Here’s a lightweight approach that scales better than “ask around on Teams”:
Explore: Use domain/workspace scoping + filters to find candidate items.
Inspect: Use the in-context details pane to sanity-check ownership, endorsement, sensitivity, and freshness.
Govern: Use the recommended actions cards to drive a small number of measurable improvements:
increase sensitivity label coverage
improve endorsement/certification where appropriate
standardize descriptions/tags for key assets
Secure: Audit role sprawl and standardize how OneLake security roles are managed across items.
Considerations and limitations to keep in mind
A few constraints called out in Learn documentation (useful when you’re setting expectations):
The Govern tab doesn’t support cross-tenant scenarios or guest users.
The Govern tab isn’t available when Private Link is activated.
Govern insights for admins can be up to a day behind due to daily refresh of admin monitoring storage.
Spark performance work is mostly execution work: understanding where the DAG splits into stages, where shuffles happen, and why a handful of tasks can dominate runtime.
This post is a quick, practical refresher on the Spark execution model — with Fabric-specific pointers on where to observe jobs, stages, and tasks.
In Spark, your code runs as a Spark application. When you run an action (for example, count(), collect(), or writing a table), Spark submits a job. Each job is broken into stages, and each stage runs a set of tasks (often one task per partition).
A useful mental model:
Tasks are the unit of parallel work.
Stages group tasks that can run together without needing data from another stage.
Stage boundaries often show up where a shuffle is required (wide dependencies like joins and aggregations).
2) Lazy evaluation: why “nothing happens” until an action
Most DataFrame / Spark SQL transformations are lazy. Spark builds a plan and only executes when an action forces it.
Example (PySpark):
from pyspark.sql.functions import col df = spark.read.table("fact_sales") # Transformations (lazy) filtered = df.filter(col("sale_date") >= "2026-01-01") # Action (executes) print(filtered.count())
This matters in Fabric notebooks because a single cell can trigger multiple jobs (for example, one job to materialize a cache and another to write output).
3) Shuffles: the moment your DAG turns expensive
A shuffle is when data must be redistributed across executors (typically by key). Shuffles introduce:
network transfer
disk I/O (shuffle files)
spill risk (memory pressure)
skew/stragglers (a few hot partitions dominate)
If you’re diagnosing a slow pipeline, assume a shuffle is the culprit until proven otherwise.
4) What to check in Fabric: jobs, stages, tasks
Fabric gives you multiple ways to see execution progress:
Notebook contextual monitoring: a progress indicator for notebook cells, with stage/task progress.
Spark monitoring / detail monitoring: drill into a Spark application and see jobs, stages, tasks, and duration breakdowns.
When looking at a slow run, focus on:
stages with large shuffle read/write
long-tail tasks (stragglers)
spill metrics (memory → disk)
skew indicators (a few tasks far slower than the median)
5) A repeatable debugging workflow (that scales)
Start with the plandf.explain(True) for DataFrame/Spark SQL
Look for Exchange operators (shuffle) and join strategies (broadcast vs shuffle join)
Run once, then open monitoringIdentify the longest stage(s)
Confirm whether it’s CPU-bound, shuffle-bound, or spill-bound
Apply the common fixes in orderAvoid the shuffle (broadcast small dims)
Reduce shuffle volume (filter early, select only needed columns)
Fix partitioning (repartition by join keys; avoid extreme partition counts)
Turn on AQE (spark.sql.adaptive.enabled=true) to let Spark coalesce shuffle partitions and mitigate skew
Quick checklist
Do I know which stage is dominating runtime?
Is there an Exchange / shuffle boundary causing it?
Are a few tasks straggling (skew), or are all tasks uniformly slow?
Shuffles are where Spark jobs go to get expensive: a wide join or aggregation forces data to move across the network, materialize shuffle files, and often spill when memory pressure spikes.
In Microsoft Fabric Spark workloads, the fastest optimization is usually the boring one: avoid the shuffle when you can, and when you can’t, make it smaller and better balanced.
This post lays out a practical, repeatable approach you can apply in Fabric notebooks and Spark job definitions.
1) Start with the simplest win: avoid the shuffle
If one side of your join is genuinely small (think lookup/dimension tables), use a broadcast join so Spark ships the small table to executors and avoids a full shuffle.
In Fabric’s Spark best practices, Microsoft explicitly calls out broadcast joins for small lookup tables as a way to avoid shuffles entirely.
Example (PySpark):
from pyspark.sql.functions import broadcast
fact = spark.read.table("fact_sales")
dim = spark.read.table("dim_product")
# If dim_product is small enough, broadcast it
joined = fact.join(broadcast(dim), on="product_id", how="left")
If you can’t broadcast safely, move to the next lever.
2) Make the shuffle less painful: tune shuffle parallelism
Spark controls the number of shuffle partitions for joins and aggregations with spark.sql.shuffle.partitions (default: 200 in Spark SQL).
Too few partitions → huge partitions → long tasks, spills, and stragglers.
Too many partitions → tiny tasks → scheduling overhead, excess shuffle metadata, and unnecessary overhead.
A decent heuristic is to start with something proportional to total executor cores and then iterate using the Spark UI (watch stage task durations, shuffle read/write sizes, and spill metrics).
3) Let Spark fix itself (when it can): enable AQE
Adaptive Query Execution (AQE) uses runtime statistics to optimize a query as it runs.
Fabric’s Spark best practices recommend enabling AQE to dynamically optimize shuffle partitions and handle skewed data automatically.
AQE is particularly helpful when:
Your input data distribution changes day-to-day
A static spark.sql.shuffle.partitions value is right for some workloads but wrong for others
You hit skew where a small number of partitions do most of the work
Spark tuning has a way of chewing up time: you start with something that “should be fine,” performance is off, costs creep up, and suddenly you’re deep in configs, Spark UI, and tribal knowledge trying to figure out what actually matters.
That’s why I’m excited to highlight sparkwise, an open-source Python package created by Santhosh Kumar Ravindran, one of my direct reports here at Microsoft. Santhosh built sparkwise to make Spark optimization in Microsoft Fabric less like folklore and more like a repeatable workflow: automated diagnostics, session profiling, and actionable recommendations to help teams drive better price-performance without turning every run into an investigation.
If you’ve ever thought, “I know something’s wrong, but I can’t quickly prove what to change,” sparkwise is aimed squarely at that gap. (PyPI)
As of January 5, 2026, the latest release is sparkwise 1.4.2 on PyPI. (PyPI)
The core idea: stop guessing, start diagnosing
Spark tuning often fails for two reasons:
Too many knobs (Spark, Delta, Fabric-specific settings, runtime behavior).
Not enough feedback (it’s hard to translate symptoms into the few changes that actually matter).
sparkwise attacks both.
It positions itself as an “automated Data Engineering specialist for Apache Spark on Microsoft Fabric,” offering:
Intelligent diagnostics
Configuration recommendations
Comprehensive session profiling …so you can get to the best price/performance outcome without turning every notebook run into a science project. (PyPI)
Why sparkwise exists (and the problems it explicitly targets)
From the project description, sparkwise focuses on the stuff that reliably burns time and money in real Fabric Spark work:
Cost optimization: detect configurations that waste capacity and extend runtime (PyPI)
Performance optimization: validate and enable Fabric-specific acceleration paths like Native Engine and resource profiles (PyPI)
Faster iteration: detect Starter Pool blockers that force slower cold starts (3–5 minutes is called out directly) (PyPI)
Learning & clarity: interactive Q&A across 133 Spark/Delta/Fabric configurations (PyPI)
Workload understanding: profiling across sessions, executors, jobs, and resources (PyPI)
Decision support: priority-ranked recommendations with impact analysis (PyPI)
If you’ve ever thought “I know something is off, but I can’t prove which change matters,” this is aimed squarely at you.
What you get: a feature tour that maps to real-world Spark pain
sparkwise’s feature set is broad, but it’s not random. It clusters nicely into a few “jobs to be done.”
1) Automated diagnostics (the fast “what’s wrong?” pass)
The diagnostics layer checks a bunch of high-impact areas, including:
Native Execution Engine: verifies Velox usage and detects fallbacks to row-based processing (PyPI)
Spark compute: analyzes Starter vs Custom Pool usage and flags immutable configs (PyPI)
Data skew detection: identifies imbalanced task distributions (PyPI)
That’s a clean “ops loop” for keeping Delta tables healthy.
A realistic “first hour” workflow I’d recommend
If you’re trying sparkwise on a real Fabric notebook today, here’s a practical order of operations:
Run diagnose.analyze() first Use it as your “triage” to catch the high-impact misconfigs (Native Engine fallback, AQE off, Starter Pool blockers). (PyPI)
Use ask.config() for any red/yellow item you don’t fully understand The point is speed: read the explanation in context and decide. (PyPI)
Profile the session If the job is still slow/expensive after obvious fixes, profile and look for the real culprit: skew, shuffle pressure, poor parallelism, memory pressure. (PyPI)
If the job smells like skew, use advanced skew detection Especially for joins and wide aggregations. (PyPI)
If your tables are growing, run storage analysis early Small files and weak partitioning quietly tax everything downstream. (PyPI)
That flow is how you turn “tuning” from an art project into a checklist.
Closing: why this matters for Fabric teams
I’m amplifying sparkwise because it’s the kind of contribution that scales beyond the person who wrote it. Santhosh took hard-earned, real-world Fabric Spark tuning experience and turned it into something other engineers can use immediately — a practical way to spot waste, unblock faster iteration, and make smarter performance tradeoffs.
If your team runs Fabric Spark workloads regularly, treat sparkwise like a lightweight tuning partner:
install it,
run the diagnostics,
act on one recommendation,
measure the improvement,
repeat.
And if you end up with feedback or feature ideas, even better — that’s how tools like this get sharper and more broadly useful.
Microsoft Fabric makes it incredibly easy to spin up Spark workloads: notebooks, Lakehouse pipelines, dataflows, SQL + Spark hybrid architectures—the whole buffet.
What’s still hard? Knowing why a given Spark job is slow, expensive, or flaky.
A Lakehouse pipeline starts timing out.
A notebook that used to finish in 5 minutes is now taking 25.
Costs spike because one model training job is shuffling half the lake.
You open the Spark UI, click around a few stages, stare at shuffle graphs, and say the traditional words of Spark debugging:
“Huh.”
This is where an AI assistant should exist.
In this post, we’ll walk through how to build exactly that for Fabric Spark: a Job Doctor that:
Reads Spark telemetry from your Fabric environment
Detects issues like skew, large shuffles, spill, and bad configuration
Uses a large language model (LLM) to explain what went wrong
Produces copy-pasteable fixes in Fabric notebooks / pipelines
Runs inside Fabric using Lakehouses, notebooks, and Azure AI models
This is not a fake product announcement. This is a blueprint you can actually build.
What Is the Fabric “Job Doctor”?
At a high level, the Job Doctor is:
A Fabric-native analytics + AI layer that continuously reads Spark job history, detects common performance anti-patterns, and generates human-readable, prescriptive recommendations.
Or an EventLog record with a payload that looks like the Spark listener event.
To build a Job Doctor, you’ll:
Read the JSON lines into Fabric Spark
Explode / parse the properties payload
Aggregate per-task metrics into per-stage metrics for each application
We’ll skip the exact parsing details (they depend on how you set up the emitter and which events/metrics you enable) and assume that after a normalization job, you have a table with one row per (applicationId, stageId, taskId).
That’s what the next sections use.
3. Capturing Query Plans in Fabric (Optional, but Powerful)
Spark query plans are gold when you’re trying to answer why a stage created a huge shuffle or why a broadcast join didn’t happen.
There isn’t yet a first-class “export query plan as JSON” API in PySpark, but in Fabric notebooks you can use a (semi-internal) trick that works today:
import json
df = ... # some DataFrame you care about
# Advanced / internal: works today but isn't a public, stable API
plan_json = json.loads(df._jdf.queryExecution().toJSON())
You can also log the human-readable plan:
df.explain(mode="formatted") # documented mode, prints a detailed plan
To persist the JSON plan for the Job Doctor, tie it to the Spark application ID:
properties (nested JSON with stage/task/metric detail)
The normalization step (which you can run as a scheduled pipeline) should:
Filter down to metrics/events relevant for performance (e.g. task / stage metrics)
Extract stageId, taskId, executorRunTime, shuffleReadBytes, etc., into top-level columns
Persist the result as job_doctor.task_metrics (or similar)
For the rest of this post, we’ll assume you’ve already done that and have a table with columns:
applicationId
stageId
taskId
executorRunTime
shuffleReadBytes
shuffleWriteBytes
memoryBytesSpilled
diskBytesSpilled
Aggregating Stage Metrics in Fabric
Now we want to collapse per-task metrics into per-stage metrics per application.
In a Fabric notebook:
from pyspark.sql import functions as F
task_metrics = spark.table("job_doctor.task_metrics")
stage_metrics = (
task_metrics
.groupBy("applicationId", "stageId")
.agg(
F.countDistinct("taskId").alias("num_tasks"),
F.sum("executorRunTime").alias("total_task_runtime_ms"),
# Depending on Spark version, you may need percentile_approx instead
F.expr("percentile(executorRunTime, 0.95)").alias("p95_task_runtime_ms"),
F.max("executorRunTime").alias("max_task_runtime_ms"),
F.sum("shuffleReadBytes").alias("shuffle_read_bytes"),
F.sum("shuffleWriteBytes").alias("shuffle_write_bytes"),
F.sum("memoryBytesSpilled").alias("memory_spill_bytes"),
F.sum("diskBytesSpilled").alias("disk_spill_bytes"),
)
.withColumn(
"skew_ratio",
F.col("max_task_runtime_ms") /
F.when(F.col("p95_task_runtime_ms") == 0, 1).otherwise(F.col("p95_task_runtime_ms"))
)
.withColumn("shuffle_read_mb", F.col("shuffle_read_bytes") / (1024**2))
.withColumn("shuffle_write_mb", F.col("shuffle_write_bytes") / (1024**2))
.withColumn(
"spill_mb",
(F.col("memory_spill_bytes") + F.col("disk_spill_bytes")) / (1024**2)
)
)
stage_metrics.write.mode("overwrite").saveAsTable("job_doctor.stage_metrics")
This gives you a Fabric Lakehouse table with:
skew_ratio
shuffle_read_mb
shuffle_write_mb
spill_mb
p95_task_runtime_ms
num_tasks, total_task_runtime_ms, etc.
You can run this notebook:
On a schedule via a Data Pipeline
Or as a Data Engineering job configured in the workspace
Part 3: Adding a Rule Engine Inside Fabric
Now that the metrics are in a Lakehouse table, let’s add a simple rule engine in Python.
This will run in a Fabric notebook (or job) and write out issues per stage.
from pyspark.sql import Row, functions as F
stage_metrics = spark.table("job_doctor.stage_metrics")
# For simplicity, we'll collect to the driver here.
# This is fine if you don't have thousands of stages.
# For very large workloads, you'd instead do this via a UDF / mapInPandas / explode.
stage_rows = stage_metrics.collect()
Define some basic rules:
def detect_issues(stage_row):
issues = []
# 1. Skew detection
if stage_row.skew_ratio and stage_row.skew_ratio > 5:
issues.append({
"issue_id": "SKEWED_STAGE",
"severity": "High",
"details": f"Skew ratio {stage_row.skew_ratio:.1f}"
})
# 2. Large shuffle
total_shuffle_mb = (stage_row.shuffle_read_mb or 0) + (stage_row.shuffle_write_mb or 0)
if total_shuffle_mb > 10_000: # > 10 GB
issues.append({
"issue_id": "LARGE_SHUFFLE",
"severity": "High",
"details": f"Total shuffle {total_shuffle_mb:.1f} MB"
})
# 3. Excessive spill
if (stage_row.spill_mb or 0) > 1_000: # > 1 GB
issues.append({
"issue_id": "EXCESSIVE_SPILL",
"severity": "Medium",
"details": f"Spill {stage_row.spill_mb:.1f} MB"
})
return issues
Apply the rules and persist the output:
issue_rows = []
for r in stage_rows:
for issue in detect_issues(r):
issue_rows.append(Row(
applicationId=r.applicationId,
stageId=r.stageId,
issue_id=issue["issue_id"],
severity=issue["severity"],
details=issue["details"]
))
issues_df = spark.createDataFrame(issue_rows)
issues_df.write.mode("overwrite").saveAsTable("job_doctor.stage_issues")
Now you have a table of Spark issues detected per run inside your Lakehouse.
Later, the LLM will use these as structured hints.
Part 4: Bringing in the LLM — Turning Metrics into Diagnosis
So far, everything has been pure Spark in Fabric.
Now we want a model (e.g., Azure AI “Models as a Service” endpoint or Azure OpenAI) to turn:
job_doctor.stage_metrics
job_doctor.stage_issues
job_doctor.spark_conf
job_doctor.query_plans
into an actual diagnosis sheet a human can act on.
In Fabric, this is simplest from a Spark notebook using a Python HTTP client.
Below, I’ll show the pattern using an Azure AI serverless model endpoint (the one that uses model: "gpt-4.1" in the body).
1. Prepare the Prompt Payload
First, fetch the data for a single Spark application:
import json
from pyspark.sql import functions as F
app_id = "app-20240501123456-0001" # however you pick which run to diagnose
stages_df = spark.table("job_doctor.stage_metrics").where(F.col("applicationId") == app_id)
issues_df = spark.table("job_doctor.stage_issues").where(F.col("applicationId") == app_id)
conf_df = spark.table("job_doctor.spark_conf").where(F.col("applicationId") == app_id)
plans_df = spark.table("job_doctor.query_plans").where(F.col("applicationId") == app_id)
stages_json = stages_df.toPandas().to_dict(orient="records")
issues_json = issues_df.toPandas().to_dict(orient="records")
conf_json = conf_df.toPandas().to_dict(orient="records")
plans_json = plans_df.toPandas().to_dict(orient="records") # likely 0 or 1 row
Then build a compact but informative prompt:
prompt = f"""
You are an expert in optimizing Apache Spark jobs running on Microsoft Fabric.
Here is summarized telemetry for one Spark application (applicationId={app_id}):
Stage metrics (JSON):
{json.dumps(stages_json, indent=2)}
Detected issues (JSON):
{json.dumps(issues_json, indent=2)}
Spark configuration (key/value list):
{json.dumps(conf_json, indent=2)}
Query plans (optional, may be empty):
{json.dumps(plans_json, indent=2)}
Your tasks:
1. Identify the top 3–5 performance issues for this run.
2. For each, explain the root cause in plain language.
3. Provide concrete fixes tailored for Fabric Spark, including:
- spark.conf settings (for notebooks/jobs)
- suggestions for pipeline settings where relevant
- SQL/DataFrame code snippets
4. Estimate likely performance impact (e.g., "30–50% reduction in runtime").
5. Call out any risky or unsafe changes that should be tested carefully.
Return your answer as markdown.
"""
2. Call an Azure AI Model from Fabric Spark
For the serverless “Models as a Service” endpoint, the pattern looks like this:
import os
import requests
# Example: using Azure AI Models as a Service
# AZURE_AI_ENDPOINT might look like: https://models.inference.ai.azure.com
AZURE_AI_ENDPOINT = os.environ["AZURE_AI_ENDPOINT"]
AZURE_AI_KEY = os.environ["AZURE_AI_KEY"]
MODEL = "gpt-4.1" # or whatever model you've enabled
headers = {
"Content-Type": "application/json",
"api-key": AZURE_AI_KEY,
}
body = {
"model": MODEL,
"messages": [
{"role": "system", "content": "You are a helpful assistant for optimizing Spark jobs on Microsoft Fabric."},
{"role": "user", "content": prompt},
],
}
resp = requests.post(
f"{AZURE_AI_ENDPOINT}/openai/chat/completions",
headers=headers,
json=body,
)
resp.raise_for_status()
diagnosis = resp.json()["choices"][0]["message"]["content"]
If you instead use a provisioned Azure OpenAI resource, the URL shape is slightly different (you call /openai/deployments/<deploymentName>/chat/completions and omit the model field), but the rest of the logic is identical.
At this point, diagnosis is markdown you can:
Render inline in the notebook with displayHTML
Save into a Lakehouse table
Feed into a Fabric semantic model for reporting
Part 5: What the Job Doctor’s Output Looks Like in Fabric
A good Job Doctor output for Fabric Spark might look like this (simplified):
🔎 Issue 1: Skewed Stage 4 (skew ratio 12.3)
What I see
Stage 4 has a skew ratio of 12.3 (max task runtime vs. p95).
This stage also reads ~18.2 GB via shuffle, which amplifies the imbalance.
Likely root cause
A join or aggregation keyed on a column where a few values dominate (e.g. a “default” ID, nulls, or a small set of hot keys). One partition ends up doing far more work than the others.
Fabric-specific fixes
In your notebook or job settings, enable Adaptive Query Execution and skew join handling:
If the query is in SQL (Lakehouse SQL endpoint), enable AQE at the session/job level through Spark configuration.
If one side of the join is a small dimension table, add a broadcast hint:
SELECT /*+ BROADCAST(dim) */ f.*
FROM fact f
JOIN dim
ON f.key = dim.key;
Estimated impact: 30–50% reduction in total job runtime, depending on how skewed the key distribution is.
📦 Issue 2: Large Shuffle in Stage 2 (~19.7 GB)
What I see
Stage 2 reads ~19.7 GB via shuffle.
Shuffle partitions are set to 200 (Spark default).
Likely root cause
A join or aggregation is shuffling nearly the full dataset, but parallelism is low given the data volume. That leads to heavy tasks and increased risk of spill.
For pipelines, set this at the Spark activity level under Spark configuration, or through your Fabric environment’s resource profile if you want a new default.
Also consider partitioning by the join key earlier in the pipeline:
df = df.repartition("customer_id")
Estimated impact: More stable runtimes and reduced likelihood of spill; wall-clock improvements if your underlying capacity has enough cores.
💾 Issue 3: Spill to Disk (~1.8 GB) in Stage 3
What I see
Stage 3 spills ~1.8 GB to disk.
This correlates with under-parallelism or memory pressure.
Fabric-specific fixes
Adjust cluster sizing via Fabric capacity / resource profiles (enough cores + memory per core).
Increase spark.sql.shuffle.partitions as above.
Avoid wide transformations producing huge intermediate rows early in the job; materialize smaller, more selective intermediates first.
You can persist the diagnosis text into a table:
from pyspark.sql import Row
spark.createDataFrame(
[Row(applicationId=app_id, diagnosis_markdown=diagnosis)]
).write.mode("append").saveAsTable("job_doctor.diagnoses")
Then you can build a Power BI report in Fabric bound to:
job_doctor.diagnoses
job_doctor.stage_metrics
job_doctor.stage_issues
to create a “Spark Job Health” dashboard where:
Rows = recent Spark runs
Columns = severity, duration, shuffle size, spill, etc.
A click opens the AI-generated diagnosis for that run
All inside the same workspace.
Part 6: Stitching It All Together in Fabric
Let’s recap the full Fabric-native architecture.
1. Telemetry Ingestion (Environment / Emitter)
Configure a Fabric environment for your Spark workloads.
Add a Fabric Apache Spark diagnostic emitter to send logs/metrics to:
Azure Storage (for Lakehouse shortcuts), or
Log Analytics / Event Hubs if you prefer KQL or streaming paths.
(Optional) From notebooks/pipelines, capture:
Spark configs → job_doctor.spark_conf
Query plans → job_doctor.query_plans
2. Normalization Job (Spark / Data Pipeline)
Read raw diagnostics from Storage via a Lakehouse shortcut.
Parse and flatten the records into per-task metrics.
For each new (or most expensive / slowest) application:
Pull stage metrics, issues, configs, and query plans from Lakehouse.
Construct a structured prompt.
Call your Azure AI / Azure OpenAI endpoint from a Fabric Spark notebook.
Store the markdown diagnosis in job_doctor.diagnoses.
4. User Experience
Fabric Notebook
A “Run Job Doctor” cell or button that takes applicationId, calls the model, and displays the markdown inline.
Data Pipeline / Job
Scheduled daily to scan all runs from yesterday and generate diagnoses automatically.
Power BI Report in Fabric
“Spark Job Health” dashboard showing:
Top slowest/most expensive jobs
Detected issues (skew, large shuffle, spill, config problems)
AI recommendations, side-by-side with raw metrics
Everything lives in one Fabric workspace, using:
Lakehouses for data
Spark notebooks / pipelines for processing
Azure AI models for reasoning
Power BI for visualization
Why a Fabric-Specific Job Doctor Is Worth Building
Spark is Spark, but in Fabric the story is different:
Spark jobs are tied closely to Lakehouses, Pipelines, Dataflows, and Power BI.
You already have a single control plane for capacity, governance, cost, and monitoring.
Logs, metrics, and reports can live right next to the workloads they describe.
That makes Fabric an ideal home for a Job Doctor:
No extra infrastructure to stand up
No random side services to glue together
The telemetry you need is already flowing; you just have to catch and shape it
AI can sit directly on top of your Lakehouse + monitoring data
With some Spark, a few Lakehouse tables, and an LLM, you can give every data engineer and analyst in your organization a “Spark performance expert” that’s always on call.
I’ve included a sample notebook you can use to get started on your Job Doctor today!
This post was created with help from (and suggested to me) by ChatGPT Pro using the 5.1 Thinking Model