Fabric Spark’s Native Execution Engine: What Speeds Up, What Falls Back, and What to Watch

The Production Migration Checklist for Fabric's Native Execution Engine

You have been running Spark on the JVM for years. It works. Your pipelines finish before the SLA alarm fires, your data scientists get their DataFrames, and you have learned to live with the garbage collector the way one learns to coexist with a roommate who occasionally rearranges all the furniture at 3 AM.

Then Microsoft shipped the Native Execution Engine for Fabric Spark, and the pitch is seductive: swap the JVM’s row-at-a-time processing for a vectorized C++ execution layer built on Meta’s Velox and Apache Gluten, get up to 6x faster query performance on compute-heavy workloads, change zero lines of code, pay nothing extra. Microsoft’s TPC-DS benchmarks at 1 TB scale show roughly 4x improvement over vanilla open-source Spark. Internal Fabric workloads have hit 6x.

Those are real numbers. But “flip the switch and go faster” is a marketing sentence, not an engineering plan. What follows is the checklist your team needs to move production Spark workloads onto the Native Execution Engine without discovering exciting new failure modes at 2 AM on a Tuesday.

Prerequisite Zero: Understand What You Are Opting Into

The Native Execution Engine does not replace Spark. It replaces Spark’s JVM-based physical execution operators — the actual computation — with native C++ equivalents for supported operations. Everything above the physical plan remains untouched: SQL parsing, logical optimization, cost-based rewrites, adaptive query execution, predicate pushdown, column pruning. None of that moves.

Here is the handoff in concrete terms. Spark produces its optimized physical plan as it always has. Apache Gluten intercepts that plan, identifies which operators have native C++ implementations in Velox, and swaps those nodes out. Velox executes them using columnar batches and SIMD instructions, processing 8, 16, or 32 values per CPU instruction instead of iterating row by row through JVM objects.

For operators Velox does not yet support, the engine falls back to standard Spark execution. The transition at the native/JVM boundary requires columnar-to-row and row-to-columnar conversions. These conversions cost real time. A workload that triggers frequent fallbacks can run slower with the engine enabled than without it.

That last sentence matters more than the benchmark numbers. The Native Execution Engine is a selective replacement of physical operators, not a uniform accelerator. Your performance outcome depends on how much of your workload stays in native territory.

Step 1: Confirm You Are on Runtime 1.3

The engine requires Fabric Runtime 1.3 (Apache Spark 3.5, Delta Lake 3.2). Runtime 1.2 support has been discontinued — and here is the dangerous part — silently. If you are still on 1.2, native acceleration is disabled without warning. You will not get an error. You will get no speedup. You will blame the engine rather than your runtime version. Check this first.

Action items:
– Open each Fabric workspace running production Spark workloads
– Navigate to Workspace Settings → Data Engineering/Science → Spark Settings
– Confirm Runtime 1.3 is selected
– If you are on Runtime 1.2, plan the runtime upgrade as a separate migration with its own validation cycle. Spark 3.4 to 3.5 brings behavioral changes unrelated to the native engine, and you do not want to debug two migrations at once

Step 2: Audit Your Workloads

Not every job benefits equally. The engine does its best work on compute-intensive analytical queries — aggregations, joins, filters, projections, complex expressions — over Parquet and Delta data. It adds less to I/O-bound workloads or jobs dominated by Python UDFs that run outside the Spark execution engine entirely.

Build a four-tier inventory:

  • Tier 1 — High-value candidates: Long-running batch ETL with heavy aggregations and joins over Delta tables. These are your biggest CU consumers and your biggest potential beneficiaries. Think: the nightly pipeline that computes vendor aggregates across three years of transaction data, currently consuming 45 minutes of a large cluster.
  • Tier 2 — Likely beneficiaries: Interactive notebooks running analytical queries. Data science feature engineering pipelines that stack transformations before model training.
  • Tier 3 — Uncertain: Workloads using exotic operators, deeply nested struct types, or heavy UDF logic. These need individual testing because you cannot predict fallback behavior from the code alone.
  • Tier 4 — Skip for now: Streaming workloads, jobs dominated by external API calls, or workloads where Python UDF processing accounts for most of the wall-clock time.

Migrate Tier 1 first. You need evidence that the engine delivers measurable wins on your actual workloads before you spend political capital rolling it out everywhere.

Step 3: Create a Non-Production Test Environment

Do not enable the engine on production and hope. Create a dedicated Fabric environment:

  1. In the Fabric portal, create a new Environment item
  2. Navigate to the Acceleration tab
  3. Check Enable native execution engine
  4. Save and Publish

Attach this environment to a non-production workspace. Run your Tier 1 workloads against it using production-scale data. This matters: performance characteristics at 10 GB do not predict behavior at 10 TB, because operator fallback patterns depend on data distributions, not just query structure.

For quick per-notebook testing without a full environment, drop this in your first cell:

%%configure
{
  "conf": {
    "spark.native.enabled": "true"
  }
}


This takes effect immediately — no session restart required — which makes A/B comparisons trivial.

Step 4: Measure Baselines

You cannot prove improvement without a baseline. For each Tier 1 workload, capture:

  • Wall-clock duration from the Spark UI (total job time, not stage time — stage time ignores scheduling and shuffle overhead)
  • CU consumption from Fabric monitoring (this is what your budget cares about)
  • Spark Advisor warnings in the current state, so you can distinguish new warnings from pre-existing noise after enabling native execution
  • Row counts and checksums on output tables — correctness verification requires a pre-migration snapshot

Store these in a Delta table. You will reference them for weeks.

Step 5: Run Native and Watch for Fallbacks

Enable the engine on your test environment and run each Tier 1 workload. Then check two things.

Performance delta: Compare wall-clock time and CU consumption against your baselines. On a genuinely compute-heavy workload, you should see at least 1.5x improvement. If you do not, something is triggering fallbacks and you are paying the columnar-to-row conversion tax without getting the native execution benefit.

Fallback alerts: The Spark Advisor now reports real-time warnings during notebook execution when operators fall back from native to JVM execution. Each alert names the specific operator that could not run natively.

The most common fallback trigger, and the most easily fixed: .show(). This call invokes collectLimit and toprettystring, neither of which has a native implementation. Replace .show() with .collect() or .toPandas() in production code. In a notebook cell you run manually for debugging, it does not matter — but inside a scheduled pipeline, every fallback is a boundary crossing.

Other triggers to watch: unsupported expression types, complex nested struct operations, and certain window function variants. For each one, ask three questions:

  1. Can I rewrite the query to avoid it? Sometimes this is a one-line change. Sometimes it means restructuring a transformation.
  2. Is the fallback on a critical path? A fallback in a logging cell is noise. A fallback inside your core join-and-aggregate chain is a problem.
  3. Is the net performance still positive? If the workload runs 3x faster overall despite two fallback warnings on minor operations, accept the win and move on.

Step 6: Validate Data Correctness

Faster means nothing if the answers change. For each migrated workload:

  • Compare output row counts between native and non-native runs on identical input data
  • Run hash comparisons on key output columns
  • For financial or compliance-sensitive pipelines, do a full row-level diff on a representative partition

The Native Execution Engine preserves Spark semantics, but floating-point arithmetic at boundary conditions, null handling in edge cases, and row ordering in non-deterministic operations all deserve explicit verification on your actual data. Do not skip this step because the TPC-DS numbers looked good. TPC-DS does not have your data shapes.

Step 7: Plan Your Rollback

The best operational property of the Native Execution Engine: it can be disabled per cell, per notebook, per environment, instantly. No restarts. No redeployments.

In PySpark:

spark.conf.set('spark.native.enabled', 'false')


In Spark SQL:

SET spark.native.enabled=FALSE;


Your rollback plan is one line of configuration. But that line only helps if your on-call engineers know it exists. Document it. Add it to your runbook. Add it to the incident response template. The worst production regression is one where the fix takes ten seconds but nobody knows about it for two hours.

Step 8: Roll Out Incrementally

With validation complete, enable the engine in production using one of three strategies, ordered from most cautious to broadest:

Option C — Per-job enablement: Add spark.native.enabled=true to individual Spark Job Definitions or notebook configure blocks. You control exactly which workloads get native execution.

Option A — Environment-level: Navigate to your production Environment → Acceleration tab → enable. All notebooks and Spark Job Definitions using this environment inherit the setting.

Option B — Workspace default: Set your native-enabled environment as the workspace default via Workspace Settings → Data Engineering/Science → Environment. Everything in the workspace picks it up.

Start with Option C on your validated Tier 1 workloads. After a week of stable production runs, graduate to Option A. Option B is for teams that have fully validated their workspace and want blanket coverage.

Step 9: Monitor the First Week

Post-migration monitoring matters because production data is not test data. In the first week:

  • Watch CU consumption trends in Fabric monitoring. Compute-heavy workloads should show measurable drops.
  • Check the Spark Advisor for fallback warnings that did not appear during testing. Different data distributions or code paths in production can trigger different operators.
  • Set alerts on job duration. A sudden increase means a new fallback or regression appeared.
  • Pay attention to any jobs that were borderline in testing. Production-scale data volume can push a workload from “mostly native” to “mostly fallback” if it exercises operators that were uncommon in test data.

Step 10: Optimize for Maximum Native Coverage

Once stable, push further:

  • Replace all .show() calls with .collect() or .display() in scheduled notebook workflows
  • Refactor deeply nested struct operations into flat columnar operations where the query logic allows it
  • Consult the Apache Gluten documentation for the current supported operator list and avoid unsupported expressions in hot paths
  • Keep data in Parquet or Delta format — the engine processes these natively, and other formats require conversion that erases the acceleration
  • For write-heavy workloads, leverage the GA-release native Delta write acceleration, which extends native execution into the output path rather than just the read and transform stages

What Does Not Change

Several things remain identical and need no migration planning:

  • Spark APIs: Your PySpark, Scala, and SQL code is unchanged. The engine operates below the API surface.
  • Delta Lake semantics: ACID transactions, time travel, schema enforcement — all handled by the same Delta Lake 3.2 layer on Runtime 1.3.
  • Cost model: No additional CU charges. Your jobs finish faster, so you consume fewer CUs for the same work. The pricing advantage is indirect but real.
  • Fault tolerance: Spark still manages task retries, stage recovery, and speculative execution. The native engine handles computation; Spark handles resilience.

The Bottom Line

The Native Execution Engine is GA. It runs on the standard Fabric runtime. The performance gains are backed by reproducible benchmarks — up to 4x on TPC-DS at 1 TB, with real-world analytical workloads frequently reaching 6x. It costs nothing to enable and one line of configuration to revert.

But there is a gap between “we turned it on and things got faster” and “we know exactly which workloads got faster, by how much, what fell back, and what to do when something breaks.” The checklist above bridges that gap.

Runtime 1.3. Audit. Baselines. Test. Fallbacks. Correctness. Rollback. Incremental rollout. Monitor. Optimize.

Ten steps. Zero heroics. Measurably faster Spark.

This post was written with help from anthropic/claude-opus-4-6

Open Mirroring + OneLake: Spark patterns that keep latency from eating your weekends

Open Mirroring + OneLake: Spark patterns that keep latency from eating your weekends

Open Mirroring + OneLake: Spark patterns that keep latency from eating your weekends

Dev is clean. Prod is chaos. In dev, your mirrored table has a cute little dataset and Spark tears through it. In prod, that same notebook starts wheezing like it ran a marathon in wet jeans.

If that sounds familiar, good. You’re not cursed. You’re running into architecture debt that Open Mirroring does not solve for you.

Open Mirroring in Microsoft Fabric does exactly what it says on the tin: it replicates data from external systems into OneLake as Delta tables, and schema changes in the source can flow through. That’s huge. It cuts out a pile of ingestion plumbing.

But mirroring only lands data. It does not guarantee your Spark reads will be fast, stable, or predictable. That’s your job.

This post is the practical playbook: what breaks, why it breaks, and the patterns that keep your Spark jobs from turning into slow-motion disasters.

first principle: mirrored is a landing zone, not a serving layer

Treat mirrored tables like an airport runway. Planes touch down there. People do not set up a picnic on the tarmac.

When teams read mirrored tables directly in hot-path jobs, they inherit whatever file layout the connector produced. Sometimes that layout is fine. Sometimes it is a junk drawer.

Spark is sensitive to this. Reading many tiny files creates scheduling and metadata overhead. Reading a few huge files kills parallelism. Either way, the cluster burns time doing the wrong work.

The fix is boring and absolutely worth it: add a curated read layer.

  1. Let Open Mirroring write into a dedicated mirror lakehouse.
  2. Run a post-mirror notebook that reshapes data for Spark (partitioning, compaction, cleanup).
  3. Have production notebooks read curated tables only.

One extra hop. Much better nights of sleep.

what actually causes the latency cliff

Two things usually punch you in the face at scale:

  • File layout drift
  • Schema drift

Let’s tackle them in order.

1) file layout drift (the silent killer)

Spark scheduling is roughly file-driven for Parquet/Delta scans. That means file shape becomes execution shape. If your table has wildly uneven files, your job speed is set by the stragglers.

Think of ten checkout lanes where nine customers have one item and one customer has a full garage sale cart. Everyone waits on that last lane.

Start by measuring file distribution, not just row counts.

from pyspark.sql import functions as F

# NOTE: inputFiles() returns a Python list of file paths
df = spark.read.format("delta").load("Tables/raw_mirrored_orders")
paths = df.inputFiles()

# Use Hadoop FS to get file sizes in bytes
jvm = spark._jvm
conf = spark._jsc.hadoopConfiguration()
fs = jvm.org.apache.hadoop.fs.FileSystem.get(conf)

sizes = []
for p in paths:
    size = fs.getFileStatus(jvm.org.apache.hadoop.fs.Path(p)).getLen()
    sizes.append((p, size))

size_df = spark.createDataFrame(sizes, ["path", "size_bytes"])

size_df.select(
    F.count("*").alias("file_count"),
    F.round(F.avg("size_bytes")/1024/1024, 2).alias("avg_mb"),
    F.round(F.expr("percentile_approx(size_bytes, 0.5)")/1024/1024, 2).alias("p50_mb"),
    F.round(F.expr("percentile_approx(size_bytes, 0.9)")/1024/1024, 2).alias("p90_mb"),
    F.round(F.max("size_bytes")/1024/1024, 2).alias("max_mb")
).show(truncate=False)

You want a tight-ish band, not chaos. A common rule of thumb is targeting roughly 128 MB to 512 MB Parquet files for balanced throughput and parallelism. Rule of thumb, not religion. Your workload decides final tuning.

Then enforce a sane shape in curated tables:

raw = spark.read.format("delta").load("Tables/raw_mirrored_orders")

(raw.write
    .format("delta")
    .mode("overwrite")
    .partitionBy("order_date")         # choose columns your queries actually filter on
    .option("maxRecordsPerFile", 500000)
    .save("Tables/curated_orders"))

spark.sql("OPTIMIZE delta.`Tables/curated_orders`")

If your queries filter by date and region, but you partition by customer_id because it “felt right,” you built a latency trap with your own hands.

2) schema drift (the 3 a.m. pager)

Open Mirroring can propagate source schema changes. That’s useful and dangerous.

Useful because your lake stays aligned. Dangerous because downstream logic often assumes a fixed shape.

A nullable column addition is usually fine. A type shift on a key metric column can quietly corrupt aggregations or explode at runtime.

Do not “notice this later.” Gate on it.

from pyspark.sql.types import StructType
import json

# Store baseline schema as JSON in Files/schemas/orders_baseline.json
with open("/lakehouse/default/Files/schemas/orders_baseline.json", "r") as f:
    baseline = StructType.fromJson(json.load(f))

current = spark.read.format("delta").load("Tables/raw_mirrored_orders").schema

base = {f.name: str(f.dataType) for f in baseline.fields}
curr = {f.name: str(f.dataType) for f in current.fields}

type_changes = [
    f"{name}: {base[name]} -> {curr[name]}"
    for name in curr
    if name in base and base[name] != curr[name]
]

new_cols = [name for name in curr if name not in base]

if type_changes:
    raise ValueError(f"Schema type changes detected: {type_changes}")

# Optional policy: allow new nullable columns but log them
if new_cols:
    print(f"New columns detected: {new_cols}")

Policy matters more than code here. Decide in advance what is auto-accepted versus what blocks the pipeline. Write it down. Enforce it every run.

lag is real, even when everything is healthy

Mirroring pipelines are replication systems, not teleportation devices. There is always some delay between source commit and mirrored availability. Sometimes tiny. Sometimes not.

If your job blindly processes “last hour” windows without checking mirror freshness, you’ll create holes and call them “data quality issues” three weeks later.

Add a freshness gate before transformations. The metadata source is connector-specific, but the pattern is universal:

from datetime import datetime, timedelta, timezone

# Example only: use the metadata table/view exposed by your mirroring setup
last_mirror_ts = spark.sql("""
  SELECT max(replication_commit_ts) as ts
  FROM mirror_metadata.orders_status
""").collect()[0]["ts"]

required_freshness = datetime.now(timezone.utc) - timedelta(minutes=15)

if last_mirror_ts is None or last_mirror_ts < required_freshness:
    raise RuntimeError(
        f"Mirror not fresh enough. Last commit: {last_mirror_ts}, required after: {required_freshness}"
    )

No freshness, no run. That one line saves you from publishing confident nonsense.

the production checklist (use this before go-live)

Before promoting any mirrored-data Spark pipeline, run this checklist in the same capacity and schedule window as production:

  • File shape check
  • Measure file count and distribution (p50, p90, max).
  • If distribution is ugly, compact and rewrite in curated.

  • Partition sanity check

  • Confirm partitions match real filter predicates.
  • Use df.explain(True) and verify PartitionFilters is not empty for common queries.

  • Schema gate check

  • Compare current schema to baseline.
  • Fail on type changes unless explicitly approved.

  • Freshness gate check

  • Validate mirrored data is fresh enough for your downstream SLA.
  • Fail fast if not.

  • Throughput reality check

  • Time representative full and filtered scans from curated tables.
  • If runtime misses SLA, fix layout first, then tune compute.

If you only do one thing from this list, do the curated layer. Direct reads from mirrored tables are the root of most performance horror stories.

architecture that holds up when volume gets ugly

Keep it simple:

  1. Mirror layer
    Open Mirroring lands source data in OneLake Delta tables. This is your raw replica.

  2. Curation job
    A scheduled Spark notebook validates schema, reshapes partitions, and compacts files.

  3. Curated layer
    Downstream Spark notebooks and SQL consumers read here, not from mirror tables.

  4. Freshness gate
    Every downstream run checks replication freshness before processing.

That’s it. No heroics. No mystery knobs. Just a clean boundary between “data landed” and “data is ready to serve.”

Open Mirroring is genuinely powerful, but it is not magic. If you treat mirrored tables as production-ready serving tables, latency will eventually kneecap you. If you treat them as a landing zone and curate aggressively, Spark behaves, stakeholders stay calm, and your weekends stay yours.

This post was written with help from anthropic/claude-opus-4-6

What “Execute Power Query Programmatically” Means for Fabric Spark Teams

What "Execute Power Query Programmatically" Means for Fabric Spark Teams

What “Execute Power Query Programmatically” Means for Fabric Spark Teams

Somewhere in a Fabric workspace right now, two teams are maintaining the same transformation twice.

The BI team owns it in Power Query. The Spark team rewrote it in PySpark so a notebook could run it on demand. Both versions work. Both versions drift. Both versions break at different times.

That was normal.

Microsoft’s new Execute Query API (preview) is the first real shot at ending that duplication. It lets you execute Power Query (M) through a public REST API from notebooks, pipelines, or any HTTP client, then stream results back in Apache Arrow format.

For Spark teams, this isn’t a minor feature. It changes where transformation logic can live.

What actually shipped

At a technical level, the API is simple:

  • Endpoint: POST /v1/workspaces/{workspaceId}/dataflows/{dataflowId}/executeQuery
  • Input: a queryName, with optional customMashupDocument (full M script)
  • Output: Arrow stream (application/vnd.apache.arrow.stream)

The execution context comes from a Dataflow Gen2 artifact in your workspace. Its configured connections determine what data sources the query can access and which credentials are used.

That single detail matters more than it looks. You’re not just “calling M from Spark.” You’re running M under dataflow-governed connectivity and permissions.

Why Spark engineers should care

Before this API, Spark teams usually had two options:

  • Rewrite M logic in PySpark
  • Or wait for a dataflow refresh and consume the output later

Neither is great. Rewrites create long-term maintenance debt. Refresh handoffs add latency and orchestration fragility.

Now you can execute the transformation inline and keep moving.

A minimal call path looks like this:

import requests
import pyarrow as pa

response = requests.post(url, headers=headers, json=request_body, stream=True)

with pa.ipc.open_stream(response.raw) as reader:
    pandas_df = reader.read_pandas()

spark_df = spark.createDataFrame(pandas_df)

No CSV hop. No JSON schema drift. No custom parsing layer.

The non-negotiable constraints

This feature is useful, but it is not magic. There are hard boundaries.

  1. 90-second timeout
    – Query evaluations must complete within 90 seconds.
    – This is ideal for fast lookups, enrichment, and reference joins—not heavy batch reshaping.

  2. Read-only execution
    – The API executes queries only. It doesn’t support write actions.
    – If your notebook flow assumes “query + write” in one API step, redesign it.

  3. Native query rule for custom mashups
    customMashupDocument does not allow native database queries.
    – But if a query defined inside the dataflow itself uses native queries, that query can be executed.
    – This distinction will trip people if they treat inline M and stored dataflow queries as equivalent.

  4. Performance depends on folding and query complexity
    – Bad folding or expensive transformations can burn your 90-second window quickly.
    – You need folding-aware query reviews before production rollout.

Practical rollout plan for Spark teams

If you lead a Fabric Spark team, do this in order.

1) Inventory duplication first

Build a short list of transformations currently duplicated between M and PySpark. Start with transformations that are stable, reused often, and mostly read-oriented.

2) Stand up a dedicated execution dataflow

Create one Dataflow Gen2 artifact specifically for API-backed execution contexts.

  • Keep connections explicit and reviewed
  • Restrict who can modify those connections
  • Treat the artifact like infrastructure, not ad hoc workspace clutter

3) Wrap Execute Query behind one notebook utility

Don’t let every notebook hand-roll HTTP logic. Create one shared helper that handles:

  • token acquisition
  • request construction
  • Arrow stream parsing
  • error handling
  • timeout/response logging

If the API returns 202 (long-running operation), honor Location and Retry-After instead of guessing polling behavior.

4) Add governance checks before scale

Because execution runs under dataflow connection scope, validate:

  • who can execute
  • what connections they indirectly inherit
  • which data sources become reachable through that path

If your governance model assumes notebook identity is the only control plane, this API changes that assumption.

5) Monitor capacity from day one

Microsoft surfaces this usage in Capacity Metrics as “Dataflows Gen2 Run Query API”, billed on the same meter family as Dataflow Gen2 refresh operations. Watch this early so you don’t discover new spend after adoption is already wide.

Where this fits (and where it doesn’t)

Use it when you need:

  • shared transformation logic between BI and engineering
  • fast, read-oriented query execution from Spark/pipelines/apps
  • connector and gateway reach already configured in dataflows

Avoid it when you need:

  • long-running transformations
  • write-heavy jobs
  • mission-critical production paths with zero preview risk tolerance

The REST API docs still mark this as preview and “not recommended for production use.” Treat that warning as real, not ceremonial.

The organizational shift hiding behind the API

The technical win is straightforward: fewer rewrites, faster integration, cleaner data handoffs.

The harder change is social.

When Spark notebooks can directly execute M, ownership lines between BI and data engineering need to be explicit. Who owns business logic? Who owns runtime reliability? Who approves connection scope?

Teams that answer those questions early will move fast.

Teams that don’t will just reinvent the same duplication problem with a new endpoint.


Source notes

This post was written with help from anthropic/claude-opus-4-6.

What the February 2026 Fabric Influencers Spotlight means for your Spark team

What the February 2026 Fabric Influencers Spotlight means for your Spark team

What the February 2026 Fabric Influencers Spotlight means for your Spark team

Microsoft published its February 2026 Fabric Influencers Spotlight last week. Twelve community posts. MVPs and Super Users. Most people skim the list. Maybe bookmark a link. Move on.

Don’t.

Three of those posts carry signals that should change how your Spark data-engineering team operates in production. Not next quarter. Now.

Signal 1: Get your production code out of notebooks

Matthias Falland’s Fabric Friday episode makes the case plainly: notebooks are great for development but risky in production. That framing resonates with a lot of production teams—and for good reason.

Here’s the nuance. Microsoft has said there’s no inherent difference in performance or monitoring capabilities between Spark Job Definitions and notebooks. Both produce Spark logs. Both run on the same compute. The gap isn’t in what the platform offers. It’s in what each artifact encourages.

Notebooks encourage improvisation. Someone edits a cell at 2 AM. Cell state carries between runs. An error gets swallowed inside an output cell and nobody notices until downstream tables go stale. That’s not a platform limitation. That’s a human-factors problem. And production environments are where human-factors problems become outages.

Spark Job Definitions push you toward cleaner habits. One file per job. No cell state. Explicit parameters. Better modularity. The execution boundary is sharper, and sharper boundaries make failures easier to diagnose.

If your team runs notebooks on a schedule through pipelines, here’s the migration:

  • Audit every notebook that runs on a schedule or gets triggered by a pipeline. Count them. You’ll be surprised.
  • Extract the transformation logic into standalone Python or Scala files. One file per job. No magic. No “run all cells.”
  • Create Spark Job Definitions for each. Map your existing notebook parameters to SJD parameters. They work the same way—just without the cell baggage.
  • Wire them into your pipeline activities. Replace the notebook activity with an SJD activity. The orchestration stays identical.
  • Keep the notebooks for development and ad-hoc exploration. That’s where they shine.

A team of three can typically convert a dozen notebooks in a week. The hard part isn’t the migration. It’s the decision to start.

Signal 2: Direct Lake changes how you write to your lakehouse

Pallavi Routaray’s post on Direct Lake architecture is the most consequential piece in the whole spotlight. Easy to miss because the title sounds like a Power BI topic.

It’s not. It’s a Spark topic.

Direct Lake mode reads Parquet files directly from OneLake. No import copy. No DirectQuery overhead. But it only works well if your Spark jobs write data in a way that Direct Lake can consume efficiently. Get the file layout wrong and your semantic model falls back to DirectQuery silently. Performance craters. Your BI team blames you. Nobody knows why.

Here’s the production checklist:

  • Enable V-Order optimization on your Delta tables. V-Order sorts and compresses Parquet files for Direct Lake’s columnar read path. Here’s the catch: V-Order is disabled by default in new Fabric workspaces, optimized for write-heavy data engineering workloads. If your workspace was created recently, you need to enable it explicitly. Check your workspace settings—or set it at the table property level. Don’t assume it’s on.
  • Control your file sizes. Microsoft’s guidance is clear: keep the number of Parquet files small and use large row groups. If your Spark jobs produce thousands of tiny files, Direct Lake will hit its file-count limits and fall back. Run OPTIMIZE on your Delta tables after write operations. Compact aggressively.
  • Partition deliberately. Over-partitioning creates too many small files. Under-partitioning creates files that are too large for efficient column pruning. Partition by the grain your BI team actually filters on. Ask them. Don’t guess.
  • Watch for schema drift. Direct Lake models bind to specific columns at creation time. If your Spark job adds or renames a column, the semantic model breaks. Coordinate schema changes explicitly. No silent ALTER TABLE commands on Friday afternoons.

The big risk here: most Spark teams don’t know their output feeds a Direct Lake model. The BI team built it after the fact. Start by mapping which of your Delta tables have Direct Lake semantic models sitting on top. If you don’t know, find out today.

Signal 3: CI/CD for Fabric just got real

Kevin Chant’s post covers the fabric-cicd tool reaching general availability for configuration-based deployments with Azure DevOps. This is verified and it matters more than it sounds.

Until now, deploying Fabric artifacts across environments—dev, test, prod—was either manual or held together with custom scripts that broke every time the API changed. The fabric-cicd tool gives you a supported, versioned path.

For Spark teams:

  • Your Spark Job Definitions, lakehouse configurations, and pipeline definitions can live in source control and deploy through a proper pipeline. No more “I’ll just update it in the portal.”
  • Configuration differences between environments—connection strings, capacity settings, lakehouse names—get handled through configuration files. Not by editing items in the portal after deployment.
  • You can roll back. You can diff. You can review before promoting to production. The basic hygiene that every other engineering discipline has had for decades.

Here’s the migration path:

  • Install fabric-cicd from the latest release. Follow Chant’s posts for the Azure DevOps YAML pipeline specifics.
  • Export your existing workspace items to a Git repository. Fabric’s Git integration handles this natively.
  • Build your environment-specific configuration files. One per environment. Map the items that differ: capacity, lakehouse, connections.
  • Set up your Azure DevOps pipeline to run fabric-cicd on merge to main. Start with dry-run mode until you trust it.
  • Remove portal-level edit access for production workspaces. This is the hard step. It’s also the one that prevents the next outage.

The deeper pattern

These three signals connect. Falland tells you to move your Spark code into artifacts built for production discipline. Routaray tells you how to write your output so downstream models don’t silently degrade. Chant tells you how to deploy the whole thing reliably across environments.

That’s a production pipeline. End to end. Code that runs cleanly, writes data correctly, and deploys safely.

The February spotlight also includes Open Mirroring hands-on guidance from Inturi Suparna Babu and a Fabric Data Agent walkthrough from Shubham Rai. Both are worth a read if you’re evaluating data replication strategies or AI-assisted query patterns over your lakehouse. But for Spark teams running production workloads, the three signals above are where the action is.

Your rollout checklist for March

  1. Inventory all scheduled notebooks. Tag them by risk: frequency, data volume, downstream dependencies.
  2. Convert the highest-risk notebook to a Spark Job Definition this week. Validate it runs identically.
  3. Audit Delta table write patterns for any table that feeds a Direct Lake model. Check that V-Order is enabled. Run OPTIMIZE to compact files.
  4. Install fabric-cicd. Connect your workspace to Git. Build your first environment config.
  5. Pick one pipeline to deploy through CI/CD end-to-end. Prove it works before scaling.

Five items. All concrete. All doable in March.

The community did the research. Your job is to act on it.

This post was written with help from anthropic/claude-opus-4-6

Keeping Spark, OneLake, and Mirroring Reliable in Microsoft Fabric

The alert fired at 2:14 AM on a Tuesday. A downstream Power BI report had gone stale — the Direct Lake dataset hadn’t refreshed in six hours. The on-call engineer opened the Fabric monitoring hub and found a cascade: three Spark notebooks had completed without triggering downstream freshness checks, a mirrored database was five hours behind, and the OneLake shortcut connecting them was returning intermittent 403 errors. It went undetected until a VP’s morning dashboard showed yesterday’s numbers.

That scenario is stressful, but it’s also solvable. These issues are usually about observability gaps between services, not broken fundamentals. If you’re running Spark workloads against OneLake with mirroring in Microsoft Fabric, you’ll likely encounter some version of this under real load. The key is having an operational playbook before it happens.

What follows is that playbook — assembled from documented production incidents, community post-mortems, and repeatable operating patterns from teams running this architecture at scale.

How Spark, OneLake, and mirroring connect (and where they don’t)

The dependency chain matters because issues can cascade through it in non-obvious ways.

Your Spark notebooks write Delta tables to OneLake lakehouses. Those tables might feed Direct Lake datasets in Power BI. Separately, Mirroring can replicate data from external sources — Azure SQL Database, Cosmos DB, Snowflake, and others — into OneLake as Delta tables. Shortcuts bridge lakehouses or reference external storage.

What makes this operationally nuanced: each layer has its own retry logic, auth tokens, and completion semantics. A Spark job can succeed from its own perspective (exit code 0, no exceptions) while the data it wrote is temporarily unavailable to downstream consumers because of a metadata sync delay. Mirroring can pause during source throttling and may not raise an immediate alert unless you monitor freshness directly. Shortcuts can go stale when target workspace permissions change.

You can end up with green pipelines and incomplete data. The gap between “the job ran” and “the data arrived correctly” is where most reliability work lives.

Detection signals you actually need

The first mistake teams make is relying on Spark job status alone. A job that completes successfully but writes zero rows, hits an unmonitored schema drift, or writes to the wrong partition is still a data quality issue.

Here’s what to watch instead:

Row count deltas. After every notebook run, compare the target table’s row count against expected intake. It doesn’t need to be exact — a threshold works. If the delta table grew by less than 10% of its average daily volume, fire a warning. Three lines of Spark SQL at the end of your notebook. Five minutes to implement. It prevents empty-table surprises at 9 AM.

OneLake file freshness. The _delta_log folder in your lakehouse tables contains JSON commit files with timestamps. If the most recent commit is older than your pipeline cadence plus a reasonable buffer, investigate. A lightweight monitoring notebook that scans these timestamps across key tables takes about twenty minutes to build.

Mirroring lag via canary rows. The monitoring hub shows mirroring status, but the granularity is coarse. For external databases, set up a canary: a table in your source that gets a timestamp updated every five minutes. Check that timestamp on the OneLake side. If the gap exceeds your SLA, you know mirroring is stalled before your users do.

Shortcut health checks. Shortcuts can degrade quietly when no direct check exists. A daily job that reads a single row from each shortcut target and validates the response catches broken permissions, expired SAS tokens, and misconfigured workspace references before they cause real damage.

Failure mode 1: the Spark write that succeeds but isn’t queryable yet

You’ll see this in Fabric notebook logs as a clean run. The Spark job processed data, performed transformations, called df.write.format("delta").mode("overwrite").save(). Exit code 0. But the data isn’t queryable from the SQL analytics endpoint, and Direct Lake still shows stale numbers.

What happened: the SQL analytics endpoint runs a separate metadata sync process that detects changes committed to lakehouse Delta tables. According to Microsoft’s documentation, under normal conditions this lag is less than one minute. But it can occasionally fall behind — sometimes significantly. The Fabric community has documented sync delays stretching to hours, particularly during periods of high platform load or when tables have large numbers of partition files.

This is the gap that catches teams off guard. The Delta commit landed in storage, but the SQL endpoint hasn’t picked it up yet.

Triage sequence:

  1. Open the lakehouse in Fabric and check the table directly via the lakehouse explorer. If the data appears there but not in the SQL endpoint, you’ve confirmed a metadata sync lag.
  2. Check Fabric capacity metrics. If your capacity is throttled (visible in the admin portal under capacity management), metadata sync can be deprioritized. Burst workloads earlier in the day can surface as sync delays later.
  3. Force a manual sync. In the SQL analytics endpoint, select “Sync” from the table context menu. You can also trigger this programmatically — Microsoft released a Refresh SQL Analytics Endpoint Metadata REST API (preview as of mid-2025), and it’s also available through the semantic-link-labs Python package.

Remediation: Add a post-write validation step to your notebooks. After writing the Delta table, wait 30 seconds, then query the SQL analytics endpoint for the max timestamp or row count. If it doesn’t match what you wrote, log a warning and retry the sync. If after three retries it still diverges, fail the pipeline explicitly so your alerting catches it. Don’t let a successful Spark job mask a downstream data gap.

Failure mode 2: mirroring goes quiet

Mirroring is genuinely useful for getting external data into OneLake without building custom pipelines. But one common reliability pattern is that replication can stall when the source system throttles or times out, and the monitoring hub may still show “Running” while data freshness drifts.

This pattern is often observed with Azure SQL Database sources during heavy read periods. The mirroring process opens change tracking connections that compete with production queries. When the source database gets busy, it can throttle the mirroring connection, and Fabric retry logic may back off for extended periods without immediately surfacing a hard error.

Triage sequence:

  1. Check mirroring status in the monitoring hub, but prioritize the “Last synced” timestamp over the status icon. “Running” with a last-sync time of four hours ago still indicates a problem.
  2. Check the source database’s connection metrics. If you’re mirroring from Azure SQL, look at DTU consumption and connection counts around the time mirroring lag increased. There’s often a correlation with a batch job or reporting burst.
  3. Inspect table-level mirroring status. Individual tables can fall behind while others sync normally. The monitoring hub aggregates this, which can hide partial lag.

Remediation: The canary-row pattern is your early warning system. For prevention, stagger heavy source-database workloads away from mirroring windows. If your Azure SQL is Standard tier, increasing DTU capacity or moving to Hyperscale gives mirroring more room. On the Fabric side, stopping and restarting mirroring resets the connection and forces a re-sync when retry backoff has become too aggressive.

Failure mode 3: shortcut permissions drift

Shortcuts are the connective tissue of OneLake — references across lakehouses, workspaces, and external storage without copying data. They deliver huge flexibility, but they benefit from explicit permission and token hygiene.

A common failure pattern: a shortcut that worked for months suddenly returns 403 errors or empty results. Spark notebooks that read from the shortcut either fail with ADLS errors or complete with zero rows if downstream checks aren’t strict.

Root causes, ranked by observed frequency in the field:

  1. A workspace admin changed role assignments, and the identity the shortcut was created under lost access. Usually accidental.
  2. For ADLS Gen2 shortcuts: the SAS token expired, or storage account firewall rules changed.
  3. Cross-tenant shortcuts relying on Entra ID B2B guest access. If guest policy changes on either tenant, shortcuts can break without a prominent Fabric notification.

Triage sequence:

  1. Open the shortcut definition in the lakehouse — Fabric shows a warning icon on broken shortcuts, but only in the lakehouse explorer.
  2. Test the shortcut target independently. Can you access the target lakehouse or storage account directly with the same identity? If not, it’s a permissions issue.
  3. For ADLS shortcuts, check storage account access logs in Azure Monitor. Look for 403 responses from Fabric service IP ranges.

Remediation: Use service principals with dedicated Fabric permissions rather than user identities for shortcuts. Set up a token rotation calendar with 30-day overlap between old and new tokens so you’re never caught by a hard expiration. Then keep a daily shortcut health-check job that reads one row from each shortcut target and validates expected row counts.

Failure mode 4: capacity throttling disguised as five different problems

This one is tricky because it can look like unrelated issues at once. Spark jobs slow down. Metadata syncs lag. Mirroring falls behind. SQL endpoint queries time out. Power BI reports go stale. Troubleshoot each symptom in isolation and you’ll end up looping.

The common thread: your Fabric capacity hit its compute limits and started throttling. Fabric uses a bursting and smoothing model — you can temporarily exceed your purchased capacity units, but that overuse gets smoothed across future time windows. The system recovers by throttling subsequent operations. A heavy Spark job at 10 AM can degrade Power BI performance at 3 PM unless capacity planning accounts for that delayed impact.

Triage sequence:

  1. Open the capacity admin portal and look at the CU consumption graph. Sustained usage above 100% followed by throttling bands is your signal.
  2. Identify top CU consumers. Spark notebooks and materialization operations (Direct Lake refreshes, semantic model processing) tend to be the heaviest. Capacity metrics break this down by workload type.
  3. Check the throttling policy and current throttling state. Fabric throttles interactive workloads first when background usage exceeds limits — meaning end users feel pain from batch jobs they never see.

Remediation: Separate workloads by time window. Push heavy Spark processing to off-peak hours. If you can’t shift the schedule, split workloads across multiple capacities — batch on one, interactive analytics on another. Set CU consumption alerts at 80% of capacity so you get warning before throttling starts.

For bursty Spark demand, also evaluate Spark Autoscale Billing. In the current Fabric model, Autoscale Billing is opt-in per capacity and runs Spark on pay-as-you-go serverless compute, so Spark jobs don’t consume your fixed Fabric CU pool. That makes it a strong option for ad-hoc spikes or unpredictable processing windows where manual SKU up/down management is too slow.

If your workload is predictable, pre-scaling SKU windows (for example, F32 to F64 before a known processing block) can still be effective — just manage cost guardrails and rollback timing tightly.

Assembling the runbook

A playbook works only if it’s accessible and actionable when the alert fires at 2 AM. Here’s how to structure it:

Tier 1 — automated checks (every pipeline cycle):
– Post-write row count validation in every Spark notebook
– Canary row freshness for every mirrored source
_delta_log timestamp scan across key tables

Tier 2 — daily health checks (scheduled monitoring job):
– Shortcut validation: read one row from every shortcut target
– Capacity CU trending: alert if 7-day rolling average exceeds 70%
– Mirroring table-level lag report (not just aggregate status)

Tier 3 — incident response (when alerts fire):
– Start with capacity metrics. If throttling is active, it’s often the shared root cause behind multi-symptom incidents.
– Check mirroring “Last synced” timestamps. Don’t rely on status icons alone.
– For Spark write issues, verify SQL endpoint sync state independently from the Delta table itself.
– For shortcut errors, test target identity access directly outside of Fabric.

Fabric gives you powerful primitives: Spark at scale, OneLake as a unified data layer, and mirroring that removes a lot of custom ingestion plumbing. With cross-service monitoring and a practical runbook, these patterns become manageable operational events instead of recurring surprises.

This post was written with help from anthropic/claude-opus-4-6

What the new ODBC Driver for Fabric Data Engineering means for your Spark team

The most consequential changes in enterprise data engineering sometimes arrive as a connection string.

On February 19, 2026, Microsoft released the ODBC Driver for Microsoft Fabric Data Engineering in public preview. It’s easy to skim past — connector announcements don’t usually change much. But this one quietly solves a problem that has been frustrating production Spark teams since Fabric launched: how do you run Spark SQL from a normal application, without notebooks, without Spark Job Definitions, without ever opening a browser?

ODBC is how. And the fact that Microsoft reached back to a 34-year-old standard to do it tells you something interesting about where Fabric is heading.

What actually shipped

Let me get specific. The driver is version 1.0.0, ODBC 3.x compliant, and runs on Windows 10/11 and Windows Server 2016+. Under the hood, it talks to Fabric’s Livy APIs. Every query you send through the ODBC interface spins up (or reuses) a Spark session on Fabric’s compute.

That distinction matters. The driver doesn’t bypass Spark. It wraps it. Your SQL statement travels through the ODBC layer, hits the Livy API, and executes as Spark SQL against your Lakehouse. This is not the same as connecting to the SQL Analytics Endpoint, which routes through a different engine entirely.

The session reuse feature deserves attention. If you’ve ever waited 30 to 45 seconds for a Fabric notebook to initialize, you’re familiar with Spark cold-start delays. The driver can hold onto an existing Spark session between queries rather than paying that startup tax every time. Set ReuseSession=true in your connection string, and consecutive queries from the same connection skip the initialization penalty.

Authentication covers five Entra ID flows: Azure CLI for local development, interactive browser for ad-hoc work, client credentials and certificates for service principals, and raw access token support. If your production pipelines already authenticate to other Fabric resources with a service principal, the same credentials work here.

What it doesn’t do: it’s Windows-only in this preview. No Linux, no macOS. It speaks Spark SQL only, not PySpark or the DataFrame API. And it’s a preview — Microsoft can change connection string parameters, error codes, and behavior before GA.

Three groups that should pay close attention

The driver’s value depends entirely on who you are and what you’re trying to connect.

Group one: .NET teams. Before this driver, getting a C# application to run Spark SQL against a Fabric Lakehouse meant either calling the Livy REST API directly (manual session management, custom error handling, lots of boilerplate) or routing through the SQL Analytics Endpoint (different engine, different performance profile, different limitations). Now it’s a connection string and System.Data.Odbc. That’s the kind of simplification that actually changes what people build.

Group two: BI tool users. Excel, legacy reporting platforms, anything that speaks ODBC — they can now connect directly to Spark compute on Fabric. This matters because Spark handles complex types like arrays, maps, and structs natively, plus it processes large analytical workloads differently than the SQL endpoint. If your Lakehouse tables use nested schemas, this driver exposes them directly rather than flattening them.

Group three: platform engineers. If you run Azure DevOps pipelines, GitHub Actions, or custom orchestrators that need to validate data or execute Spark SQL as part of a deployment, the ODBC driver with service principal auth gives you a programmatic, credential-managed path with no UI interaction required. This is what “infrastructure as code” looks like for Spark connectivity.

Trade-offs to plan for

Every feature comes with trade-offs, and it’s worth understanding these before you roll out broadly.

Every ODBC connection that creates a new Spark session consumes Fabric capacity. Imagine ten analysts each open an ODBC connection from their BI tool. That’s ten concurrent Spark sessions, all burning CU seconds. The session reuse feature helps within a single connection, but it doesn’t pool sessions across users. On a shared capacity, CU consumption can add up faster than you’d expect.

Then there’s the timeout problem. Fabric’s Livy sessions have a default idle timeout. If an analyst runs a query, spends eight minutes reading the results, and runs another, the session may have timed out. The next query pays the full cold-start penalty again. For interactive workflows, it’s worth planning for this — users will see variable response times, and understanding why helps set the right expectations.

The Windows-only constraint creates a real deployment asymmetry. Many data engineering teams develop on macOS or Linux. They can use the JDBC driver locally (which is cross-platform) but can’t use the ODBC driver until they deploy to a Windows CI/CD agent or server. That means some behaviors will only surface in the deployment environment, so factor in extra validation time for Windows-hosted stages.

A rollout checklist for Spark team leads

If you’re evaluating this driver for production, here’s a concrete sequence:

  1. Map your current connectivity. Catalog every application and tool querying your Lakehouse today. Note which ones use the SQL Analytics Endpoint, which call Livy directly, and which use the JDBC driver. The ODBC driver fills gaps — it doesn’t need to replace things that already work.

  2. Benchmark session reuse under your actual patterns. Set ReuseSession=true and run your typical query workload. Measure the difference between first-query latency (cold start) and subsequent-query latency (warm session). If your workload involves long idle gaps between queries, session reuse won’t save you much, and you’ll need to decide whether to accept the latency or build a keep-alive mechanism.

  3. Model the capacity cost before rolling out broadly. For each application or tool that would use the driver, estimate concurrent Spark sessions. Multiply by CU cost per session-hour. Compare this against routing the same queries through the SQL Analytics Endpoint. For simple aggregations on well-structured tables, the SQL endpoint is often cheaper. Reserve the ODBC-to-Spark path for workloads that genuinely need Spark’s capabilities.

  4. Use service principal auth from day one. Azure CLI auth is fine for a proof of concept. In production, configure a dedicated service principal with minimum permissions on your workspace. Store credentials in Azure Key Vault. Personal tokens in pipelines are something you’ll want to migrate away from early.

  5. Abstract the connection layer. Because this is a preview, put the ODBC connection behind an interface in your application code. If you need to fall back to direct Livy API calls or swap in the JDBC driver, you should be able to do that without touching business logic.

  6. Set up session monitoring and alerts. Use the Fabric capacity metrics app or monitoring APIs to track active Spark sessions. Alert if the session count crosses a threshold tied to your CU budget. This catches runaway connections before they become a capacity incident.

  7. Pin the driver version. Download 1.0.0, deploy it to your target machines, and only upgrade after testing the new version against your workloads. Auto-updating preview drivers in production is a risk worth avoiding.

Where this fits in Fabric’s arc

There’s a pattern worth noticing. First Microsoft shipped notebooks. Then Spark Job Definitions. Then the JDBC driver for Java. Now the ODBC driver for everything else. Each release pushes Spark compute further from the Fabric browser UI and closer to the tools and workflows teams already use.

The direction is unmistakable: Microsoft wants Fabric’s Lakehouse queryable from anywhere, through whatever protocol your application already speaks. Two years ago, Spark in Fabric meant opening a browser and writing notebook cells. Today it means passing a connection string to pyodbc or System.Data.Odbc and running SQL from whatever runtime you prefer.

For Spark teams already running in Fabric, the ODBC driver is a pragmatic addition that fills a real connectivity gap. For teams evaluating Fabric, it lowers the integration barrier with existing .NET, Python, and BI toolchains. And for the platform engineers who spend their days wiring systems together, it replaces custom Livy API wrappers with a standard interface that every operating system and language already knows how to talk to.

Sometimes the most interesting changes arrive in the most unremarkable packaging.

This post was written with help from anthropic/claude-opus-4-6

fabric-cicd Is Now Officially Supported — Here’s Your Production Deployment Checklist

Three days ago, Microsoft promoted fabric-cicd from community project to officially supported tool. That Python library your team has been running in a “we’re still figuring out our deployment process” sort of way now carries Microsoft’s name and their support commitment.

That shift matters in three concrete places. First, your compliance team can stop asking “is this thing even supported?” Second, you can open Microsoft support tickets when it breaks. Third, the roadmap is no longer a volunteer effort. Features will land faster. Bugs will get fixed on a schedule.

But here’s where most teams stall. They read the announcement, nod approvingly, and then do absolutely nothing different. The notebook still gets deployed by clicking sync in the browser. The lakehouse GUID is still hardcoded. The “production” workspace is still one bad merge away from serving yesterday’s dev code to the entire analytics team.

An announcement without an execution plan is just news. Let’s build the plan.

What Fabric-CICD Does (and Where It Stops)

Understand the boundaries before you reorganize your deployment story. fabric-cicd is a Python library. You give it a Git repository, a target workspace ID, and a list of item types. It reads the item definitions from the repo, resolves dependencies between them, applies parameter substitutions, and pushes everything to the workspace. It can also remove orphan items that exist in the workspace but no longer appear in your repo.

It supports 25 item types: Notebooks, SparkJobDefinitions, Environments, Lakehouses, DataPipelines, SemanticModels, Warehouses, and 18 others. Every deployment is a full deployment. No commit diffs, no incremental updates. The entire in-scope state gets pushed every time.

Where it stops: it won’t manage your Spark compute sizing, it won’t migrate lakehouse data between environments, and it won’t coordinate multi-workspace transactions atomically. Those gaps are yours to fill. That’s not a weakness. A tool that owns its scope and does it well beats one that covers everything and nails nothing.

Prerequisite Zero: Get Your Git House in Order

This is the part that takes longer than anyone budgets for.

fabric-cicd reads from a Git repository. If your Fabric workspace isn’t connected to one, the tool has nothing to deploy. And plenty of Spark teams are still running workspaces where notebooks were born in the browser, edited in the browser, and will die in the browser without ever touching version control.

Connect your workspace to Azure DevOps or GitHub through Fabric’s Git Integration. Every notebook, every Spark job definition, every environment configuration goes into source control. All of it.

If your repo currently contains items named notebook_v2_final_FINAL_USE_THIS_ONE — and honestly, most of us have been there — now’s the time to clean that up before automating. Automating a disorganized repo just moves the disorganization faster. Getting the foundation right first saves real time down the road.

Your target state when this prerequisite is done: a main branch that mirrors production, feature branches for development work, and a merge strategy the whole team agrees on. fabric-cicd reads from a directory on disk. What it reads needs to be coherent.

The Parameter File: The Single Most Important Artifact

The parameter.yml file is where fabric-cicd learns the difference between your dev environment and production. Without it, you’re deploying identical configurations everywhere, which means your production notebooks will happily point at your dev lakehouse.

For Spark teams, four categories of parameter entries matter:

Default Lakehouse IDs. Every notebook binds to a lakehouse by GUID. In dev, that GUID points to your sandbox with test data. In production, it points to the lakehouse with three months of curated, retention-managed data. The parameter file swaps those GUIDs at deploy time. Miss one, and your production job reads from a lakehouse that got wiped last Tuesday.

Default Lakehouse Workspace IDs. If your production lakehouse lives in a separate workspace from dev (and it should), this mapping covers that scope. Lakehouse GUIDs alone aren’t enough when workspaces differ between environments.

Connection strings. Any notebook that pulls from an external data source needs environment-specific connection details. Hardcoded connection strings are how you end up running your production Spark cluster against a dev SQL database. That kind of mismatch can get expensive quickly — and it’s entirely preventable with proper parameterization.

Notebook parameter cells. Fabric lets you define parameter cells in notebooks. Every value that changes between environments belongs there, referenced by parameter.yml. Not in a comment. Not in a variable halfway down the notebook. In the parameter cell, where the tooling can find it.

The mechanism is find-and-replace. fabric-cicd scans your repository files for specific strings and swaps in the values for the target environment. This means the GUIDs in your repo must be consistent. If someone manually edited a lakehouse ID through the browser after a sync, the parameter file won’t catch the mismatch. Deployments will succeed. The notebook will fail. Those are the worst kind of bugs: silent ones.

Build Your Pipeline in Four Stages

Here’s a pipeline structure built for Spark teams, in the order things should execute:

Stage 1: Validate. Run your tests before anything deploys. If you have PySpark unit tests (even five of them), execute them against a local SparkSession or a lightweight Fabric environment. This catches broken imports, renamed functions, and bad type signatures. The goal isn’t 100% test coverage. The goal is catching the obvious failures before they reach a workspace anyone else depends on.

Stage 2: Build. Initialize the FabricWorkspace object with your target workspace ID, environment name, repository path, and scoped item types. For Spark teams, start with ["Notebook", "SparkJobDefinition", "Environment", "Lakehouse"]. Do not scope every item type on day one. Start with the items you deploy weekly. Expand scope after the first month, when you’ve seen how it behaves.

Stage 3: Deploy. Call publish_all_items(). The tool resolves dependency ordering, so if a notebook depends on a lakehouse that depends on an environment configuration, the sequence is handled. After publishing, call unpublish_all_orphan_items() to clean up workspace items that no longer appear in the repo. Skipping orphan cleanup means your workspace accumulates dead items that confuse the team and waste capacity.

Stage 4: Verify. This is the stage teams skip, and the one that saves them. After deployment, run a smoke test against the target workspace. Can the notebook open? Does it bind to the correct lakehouse? Can a lightweight execution complete without errors? A deployment that returns exit code zero but leaves notebooks pointing at a deleted lakehouse is not a successful deployment. Your pipeline shouldn’t treat it as one.

Guardrails Worth the Setup Cost

Guardrails turn a pipeline from a deployment mechanism into a safety net. These four are worth the setup time:

Approval gates. Require explicit human approval before any deployment to Production. fabric-cicd won’t enforce this for you. Wire it into your pipeline platform: Azure DevOps release gates, GitHub Actions environments with required reviewers. The first time a broken merge auto-deploys to production, you’ll wish you had spent the twenty minutes setting this up.

Service principal authentication. Run your pipeline under a service principal, not a user account. Give the principal workspace contributor access on the target workspace. Nothing more. When someone leaves the team or changes roles, deployments keep working because they never depended on that person’s credentials.

Tested rollback. Since fabric-cicd does full deployments from the repo, rollback means redeploying the last known-good commit. Conceptually clean. But “conceptually clean” doesn’t help you during an incident when stakeholders need answers fast. Test the rollback. Revert a deployment on a Tuesday afternoon when nothing is on fire. Confirm the workspace returns to its previous state. If you haven’t tested it, your rollback plan is still untested — and untested plans have a way of surprising you at the worst possible moment.

Deployment artifacts. Every pipeline run should log which items deployed, which parameters were substituted, and which orphans were removed. When production breaks and someone asks “what changed since yesterday?”, the answer should take thirty seconds, not three hours of comparing workspace states by hand.

Spark-Specific Problems Nobody Warns You About

General CI/CD guidance covers the broad strokes. Spark teams hit problems that live in the details:

Lakehouse bindings are buried in notebook content. The notebook-content.py file contains lakehouse and workspace GUIDs. If your parameter.yml misses even one of these, the production notebook opens to a “lakehouse not found” error. Audit every notebook, including the utility notebooks that other notebooks call with %run. Those hidden dependencies are where the bindings go wrong.

Environment items gate notebook execution. When your Spark notebooks depend on a custom Environment with specific Python libraries or Spark configuration properties, that Environment must exist in the target workspace before the notebooks arrive. The fabric-cicd dependency resolver handles this automatically, but only if Environment is in your item_type_in_scope. Scope just Notebook without Environment, and you’ll get clean deployments followed by runtime failures when the expected libraries don’t exist.

SparkJobDefinitions are not notebooks. SJDs carry executor counts, driver memory settings, reference files, and command-line arguments. All environment-specific values in these properties need coverage in your parameter file. Teams that parameterize their notebooks thoroughly and forget about their SJDs discover the gap when a production batch job runs with dev-sized executors and takes four times longer than expected.

Full deployment at scale needs scoping. Fifty notebooks deploy in minutes. Three hundred notebooks take longer and increase your blast radius. If your workspace has grown large, segment your repository by domain or narrow item_type_in_scope per pipeline to keep deployment times predictable and failures contained to a known set of items.

A Four-Week Migration Path

Starting from zero, here’s a timeline that’s aggressive but achievable:

Week 1: Git integration. Connect your workspace to source control. Rename items that need renaming. Agree on a branching strategy with the team. Write it down. Nothing deploys this week. This is foundation work, and skipping it makes everything after it harder.

Week 2: First deployment. Install fabric-cicd, write your initial parameter.yml, and run a deployment to a test workspace from the command line. Intentionally break the lakehouse binding in the parameter file. See what the error looks like. Fix it. Run it again. You want the team to recognize deployment failures before they encounter one under pressure.

Week 3: Pipeline construction. Build the CI/CD pipeline for Dev-to-Test promotion. Add approval gates, service principal auth, logging, and the verify stage. Run the pipeline ten times. Deliberately introduce a bad merge and watch the pipeline catch it. If it doesn’t catch it, fix the pipeline.

Week 4: Production extension. Extend the pipeline to include Production as a target. Add smoke tests. Test your rollback procedure. Write the runbook. Walk the team through it. Make sure at least two people can operate the pipeline without you in the room.

Four weeks. Not a quarter. Not a planning exercise that stalls in sprint three. A month of focused, methodical work that moves your Spark team from manual deployment to a process that runs the same way every time, whether it’s Tuesday at noon or Saturday at midnight.

The Real Takeaway

Microsoft giving fabric-cicd the official stamp means enterprise teams can stop hesitating. The library will get more attention, faster bug fixes, and broader item type support going forward.

But the tool is only half the story. A perfectly automated pipeline that deploys unparameterized notebooks to the wrong lakehouse is worse than manual deployment, because at least manual deployment forces someone to look at what they’re pushing. Automation works best when it’s built on a disciplined foundation — the checklist, the parameter file, the tested rollback, the verify stage.

Build the checklist. Work the checklist. Invest in the hard parts now, and they’ll pay you back in every deployment after.

This post was written with help from anthropic/claude-opus-4-6

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

There’s a connector that ships with every Fabric Spark runtime. It’s pre-installed. It requires no setup. And it lets your Spark notebooks read from—and write to—Fabric Data Warehouse tables as naturally as they read Delta tables from a Lakehouse.

Most Fabric Spark users don’t know it exists. The ones who do often run into the same three or four surprises. Let’s fix both problems.

What the connector actually is

The Spark connector for Fabric Data Warehouse (synapsesql) is a built-in extension to the Spark DataFrame API. It uses the TDS protocol to talk directly to the SQL engine behind your Warehouse (or the SQL analytics endpoint of a Lakehouse). You get read and write access to Warehouse tables from PySpark, Scala Spark, or Spark SQL.

One line of code to read:

from com.microsoft.spark.fabric.Constants import Constants

df = spark.read.synapsesql("my_warehouse.dbo.sales_fact")


One line to write:

df.write.mode("append").synapsesql("my_warehouse.dbo.sales_fact")


No connection strings. No passwords. No JDBC driver management. Authentication flows through Microsoft Entra—same identity you’re logged into your Fabric workspace with. The connector resolves the SQL endpoint automatically based on workspace context.

That’s the happy path. Now let’s talk about what actually happens when you use it.

Reading: the part that mostly just works

Reading from a Warehouse table into a Spark DataFrame is the connector’s strength. The synapsesql() call supports the full three-part naming convention: warehouse_name.schema_name.table_or_view_name. It works for tables and views, including views with joins across schemas.

A few things that are genuinely useful:

Predicate pushdown works. When you chain .filter() or .limit() onto your DataFrame, the connector pushes those constraints to the SQL engine. You’re not pulling the full table into Spark memory and then filtering—the SQL engine handles the filter and sends back the subset. This matters when your Warehouse tables have hundreds of millions of rows and you only need a time-sliced sample.

df = spark.read.synapsesql("my_warehouse.dbo.sales_fact") \
    .filter("order_date >= '2026-01-01'") \
    .select("order_id", "customer_id", "amount")


Cross-workspace reads work. If your Warehouse lives in a different workspace than your notebook’s attached Lakehouse, you pass the workspace ID:

df = spark.read \
    .option(Constants.WorkspaceId, "<target-workspace-id>") \
    .option(Constants.DatawarehouseId, "<warehouse-item-id>") \
    .synapsesql("my_warehouse.dbo.sales_fact")


This is genuinely powerful for hub-and-spoke architectures where your curated Warehouse sits in a production workspace and your data science notebooks live in a sandbox workspace.

Parallel reads are available. For large tables, you can partition the read across multiple Spark tasks, similar to spark.read.jdbc:

df = spark.read \
    .option("partitionColumn", "order_id") \
    .option("lowerBound", 1) \
    .option("upperBound", 10000000) \
    .option("numPartitions", 8) \
    .synapsesql("my_warehouse.dbo.sales_fact")


This splits the query into eight parallel reads, each fetching a range of order_id. Without this, you get a single-threaded read that will bottleneck on large tables.

Security models pass through. If your Warehouse has object-level security (OLS), row-level security (RLS), or column-level security (CLS), those policies are enforced when Spark reads the data. Your notebook sees exactly what your identity is authorized to see. This is a meaningful difference from reading Delta files directly via OneLake, where security operates at the workspace or folder level.

Custom T-SQL queries work too. You’re not limited to reading tables—you can pass arbitrary T-SQL:

df = spark.read \
    .option(Constants.DatabaseName, "my_warehouse") \
    .synapsesql("SELECT TOP 1000 * FROM dbo.sales_fact WHERE region = 'WEST'")


This is handy for complex aggregations or when you want the SQL engine to do the heavy lifting before data enters Spark.

Writing: the part with surprises

Write support for the Spark-to-Warehouse connector became generally available with Runtime 1.3. It works, and it solves a real architectural problem—but it has mechanics you need to understand.

How writes actually work under the hood

The connector uses a two-phase process:

  1. Stage: Spark writes your DataFrame to intermediate Parquet files in a staging location.
  2. Load: The connector issues a COPY INTO command, telling the Warehouse SQL engine to ingest from the staged files.

This is the same COPY INTO pattern that powers bulk ingestion into Fabric Data Warehouse generally. It’s optimized for throughput. It is not optimized for latency on small writes.

If you’re writing a DataFrame with 50 rows, the overhead of staging files and issuing COPY INTO means the write takes materially longer than you’d expect. For small, frequent writes, this connector is not the right tool. Use T-SQL INSERT statements through a SQL connection instead.

For batch writes of thousands to millions of rows, the connector performs well. The COPY INTO path is what the Warehouse was designed for.

Save modes

The connector supports four save modes:

  • errorifexists (default): Fails if the table already exists.
  • ignore: Silently skips the write if the table exists.
  • overwrite: Drops and recreates the table with new data.
  • append: Adds rows to the existing table.
df.write.mode("overwrite").synapsesql("my_warehouse.dbo.daily_aggregates")


A common pattern: Spark computes daily aggregations from Lakehouse Delta tables, then writes the results to a Warehouse table that Power BI reports connect to. The Warehouse’s result set caching (now generally available as of January 2026) means subsequent Power BI refreshes hit cache instead of recomputing.

The timestamp_ntz gotcha

This is the single most common error people hit when writing to a Warehouse from Spark.

If your DataFrame contains timestamp_ntz (timestamp without time zone) columns, the write will fail. Fabric Data Warehouse expects time-zone-aware timestamps. The fix is a cast before you write:

from pyspark.sql.functions import col

for c in df.columns:
    if dict(df.dtypes)[c] == "timestamp_ntz":
        df = df.withColumn(c, col(c).cast("timestamp"))

df.write.mode("append").synapsesql("my_warehouse.dbo.target_table")


This is not documented prominently enough. If you see a Py4JJavaError during write that mentions type conversion, timestamps are the first thing to check.

What you can’t write to

The connector writes to Warehouse tables only. You cannot write to the SQL analytics endpoint of a Lakehouse—it’s read-only. If you try, you’ll get an error. This seems obvious but trips people up because the same synapsesql() method handles both reads from Warehouses and Lakehouse SQL endpoints.

Private Link limitations

If Private Link is enabled at the workspace level, both read and write operations through the connector are unsupported. If Private Link is enabled at the tenant level only, writes are unsupported but reads still work. This is a significant limitation for security-conscious deployments. Check your network configuration before building pipelines that depend on this connector.

Time Travel is not supported

Fabric Data Warehouse now supports Time Travel queries. However, the Spark connector does not pass through Time Travel syntax. If you need to query a table as of a specific point in time, you’ll need to use a T-SQL connection directly rather than the synapsesql() method.

When to use Warehouse vs. Lakehouse as your serving layer

This is the architectural question that the connector’s existence forces you to answer. You’ve got data in your Lakehouse. Spark has transformed it. Now where does it go?

Use Lakehouse Delta tables when:

  • Your consumers are other Spark notebooks or Spark-based ML pipelines.
  • You need schema evolution flexibility (Delta’s schema merge).
  • You want to use OPTIMIZE, VACUUM, and Z-ORDER for table maintenance.
  • Your data scientists need direct file access through OneLake APIs.

Use Warehouse tables when:

  • Your primary consumers are Power BI reports or T-SQL analysts.
  • You need the Warehouse’s result set caching for repeated query patterns.
  • You need fine-grained security (RLS, CLS, OLS) that passes through to all consumers.
  • You want to use T-SQL stored procedures, views, and MERGE statements for downstream transformations.
  • You need cross-database queries that join Warehouse tables with Lakehouse tables or other Warehouse tables.

Use both when:

  • Spark processes and stores data in the Lakehouse (bronze → silver → gold medallion layers), then the connector writes final aggregations or serving tables to the Warehouse.
  • The Warehouse serves as the “last mile” between your data engineering work and your business intelligence layer.

The January 2026 GA of MERGE in Fabric Data Warehouse makes the “write to Warehouse” pattern significantly more useful. You can now do incremental upserts: Spark writes a staging table, then a T-SQL MERGE reconciles it with the target. This is a common pattern in data warehousing that was previously awkward in Fabric.

A concrete pattern: Spark ETL → Warehouse serving layer

Here’s the pattern I see working well in production:

# 1. Read from Lakehouse Delta tables (Spark native)
bronze = spark.read.format("delta").load("Tables/raw_orders")

# 2. Transform in Spark
silver = bronze.filter(col("status") != "cancelled") \
    .withColumn("order_date", col("order_ts").cast("date")) \
    .withColumn("amount_usd", col("amount") * col("fx_rate"))

gold = silver.groupBy("region", "order_date") \
    .agg(
        count("order_id").alias("order_count"),
        sum("amount_usd").alias("total_revenue")
    )

# 3. Write to Warehouse for Power BI consumption
gold.write.mode("overwrite").synapsesql("analytics_warehouse.dbo.daily_revenue")


The Lakehouse owns the raw and transformed data. Spark does the heavy compute. The Warehouse serves the final tables to downstream consumers with T-SQL access, caching, and fine-grained security.

The alternative—writing gold tables to the Lakehouse and having Power BI connect via the SQL analytics endpoint—also works. But the SQL analytics endpoint has a metadata sync delay after Spark writes new data. The Warehouse table is immediately consistent after the COPY INTO completes. If your reporting needs to reflect the latest pipeline run without a sync lag, the Warehouse path is more reliable.

Cross-database queries: the glue between them

Once you have data in both a Lakehouse and a Warehouse in the same workspace, you can query across them using T-SQL cross-database queries from the Warehouse:

SELECT w.customer_id, w.total_revenue, l.customer_segment
FROM analytics_warehouse.dbo.daily_revenue AS w
JOIN my_lakehouse.dbo.customer_dim AS l
    ON w.customer_id = l.customer_id


This means your Warehouse doesn’t need to contain all the data. It can hold the curated aggregations while joining against dimension tables that live in the Lakehouse. No data movement. No duplication. The SQL engine resolves both sources through OneLake.

Performance notes from the field

A few observations from real workloads:

Reads are faster than you expect. The TDS protocol connection to the Warehouse SQL engine is efficient. For typical analytical queries returning thousands to low millions of rows, the synapsesql() read is competitive with reading Delta files directly, especially when the Warehouse has statistics and result set caching enabled.

Writes are slower than Lakehouse writes. The two-phase staging + COPY INTO process adds overhead versus a direct df.write.format("delta").save() to Lakehouse tables. For a DataFrame with 10 million rows, expect the Warehouse write to take 2-5x longer than an equivalent Lakehouse Delta write. This is the tradeoff for getting immediate T-SQL access with full Warehouse capabilities.

Use parallel reads for large tables. The default single-partition read will bottleneck. Set numPartitions to match your Spark cluster’s available cores for large reads. The performance improvement is often 4-8x.

Proactive and incremental statistics refresh. As of January 2026, Fabric Data Warehouse supports proactive statistics refresh and incremental statistics. This means the query optimizer keeps statistics up to date automatically. Your synapsesql() reads benefit from better query plans without manual UPDATE STATISTICS calls.

The honest summary

The Spark connector for Fabric Data Warehouse is a well-designed bridge between two systems that many teams use side by side. It makes the read path simple and the write path possible without leaving your Spark notebook.

It is not a replacement for writing to Lakehouse Delta tables. It is an additional output path for when your downstream consumers need T-SQL, fine-grained security, result set caching, or immediate consistency. Use it when the Warehouse is the right serving layer. Don’t use it when Lakehouse is sufficient.

The biggest wins come from combining both: Spark for compute, Lakehouse for storage, Warehouse for serving. The connector is the plumbing that makes that architecture work without data pipelines in between.

If you’re heading to FabCon Atlanta (March 16-20, 2026), both the Data Warehouse and Data Engineering teams will be there. It’s a good place to pressure-test your architecture and see what’s coming next.


This post was written with help from anthropic/claude-opus-4-6

Fabric Spark billing just got clearer. Here’s how to make the most of it.

Somewhere in a shared Teams channel, a Fabric capacity admin is looking at the Capacity Metrics app and noticing Spark consumption is down 15% overnight. Same notebooks. Same schedules. Same engineers shipping code with the same amount of caffeine.

A quick thread later, the answer is clear: nothing is wrong. Microsoft introduced new billing operations, and AI usage is now visible in its own category.

That’s not a cost increase. That’s better instrumentation.

What actually changed

On February 13, 2026, Microsoft announced two new billing operations for Fabric: AI Functions and AI Services.

Previously, AI-related usage in notebooks was grouped under Spark operations. Calls made through fabric.functions, Azure OpenAI REST API, the Python SDK, and SynapseML were all reported in Spark. Text Analytics and Azure AI Translator calls from notebooks were also reflected there.

Now those costs are separated:

  • AI Functions covers Fabric AI function calls and Azure OpenAI Service usage in notebooks and Dataflows Gen2.
  • AI Services covers Text Analytics and Azure AI Translator usage from notebooks.

Both are billed under the Copilot and AI Capacity Usage CU meter.

Important: consumption rates did not change. You pay the same for the same work. What changed is visibility.

Why this reporting update is a win for operators

If you’ve ever tried to explain Spark trends that include hidden AI consumption, this update helps immediately.

Picture an F64 capacity. You historically allocated 70% of CU budget to Spark because that’s what Capacity Metrics showed. But Spark previously included AI consumption, so the category was doing two jobs at once.

Now Spark and AI can each tell their own story. That’s useful for:

  • more accurate workload attribution
  • cleaner alerting by operation type
  • better planning conversations with finance and platform teams

In other words: same total spend, sharper signal.

The migration checklist

There’s nothing to deploy and no code changes required. The opportunity is operational: update your monitoring and planning so you can benefit from the new detail right away.

1. Audit your AI function usage

Before the new operations appear in your Metrics app, find AI calls in your codebase. Search notebooks for:

  • fabric.functions calls
  • Azure OpenAI REST API calls (look for /openai/deployments/)
  • openai Python SDK usage within Fabric notebooks
  • SynapseML OpenAI transformers
  • Text Analytics API calls
  • Azure AI Translator calls

If there are no hits, this billing split likely won’t affect your current workloads. If there are many hits (common in mature notebook estates), estimate volume now so your post-change analysis is faster.

2. Baseline your current Spark consumption

Export the last 30 days of Capacity Metrics data for Spark operations and save it.

This is your before-state. After rollout, validate that total consumption (Spark + new AI operations) aligns with historical Spark totals. If it aligns, you’ve confirmed a reporting change. If not, you have a clear starting point for investigation.

3. Adjust your alerting thresholds

If you monitor Spark CU consumption via Capacity Metrics, Azure Monitor, or custom API polling, update thresholds after the split.

Recommended approach:

  • take your current Spark threshold
  • subtract estimated AI consumption from step 1
  • set that as the revised Spark threshold
  • add a separate alert for the Copilot and AI meter

If AI estimates are still rough, start with a conservative threshold and tune after a few weeks of separated data.

4. Update your capacity planning models

Add a dedicated row for AI consumption in any spreadsheet, Power BI report, or planning document that allocates CU budget by operation type.

The Copilot and AI Capacity Usage CU meter already existed for Copilot scenarios, but this may be the first time many Spark-first teams see meaningful workload usage there. Adding it now makes future reviews easier.

5. Set up a validation window

Choose a date after March 17 (when the new operations start appearing) and compare pre/post totals:

  • pre-change: Spark total
  • post-change: Spark + AI Functions + AI Services

Expect close alignment (allowing for normal workload variation and rounding). If variance is more than a few percent, open a support ticket. Microsoft described this as a reporting-only change with no rate modifications.

6. Share a quick team note before questions start

One short update prevents a lot of confusion:

“Microsoft is separating AI consumption from Spark billing into dedicated operations. Total cost is unchanged. Spark will appear lower, and Copilot and AI will appear higher. This improves visibility and tracking.”

That gives engineers context and helps finance teams interpret new categories correctly on day one.

Post-rollout checks that keep things clean

Consumption variance check. If post-change totals (Spark + AI Functions + AI Services) differ significantly from pre-change Spark trends, compare equivalent workload windows and rule out schedule, code, or capacity changes.

Expected operation visibility. If you confirmed AI usage in step 1 but AI Functions shows zero, check regional rollout timing from the Fabric blog before escalating.

Why separated AI spend is valuable

This platform-side categorization update gives teams a better lens on where capacity is being used.

Once AI usage is measurable independently, you can answer higher-quality questions:

  • Which AI workflows are creating the most value per CU?
  • Which calls are production-critical versus experimental leftovers?
  • Where should you optimize first for performance and cost?

That is exactly the kind of visibility mature platform teams want.

What this signals about Fabric billing

As Fabric workloads evolve, billing categories will continue to become more descriptive. That’s a good thing. Better category design means better operational decisions.

The admin in that Teams thread got clarity quickly: Spark wasn’t shrinking, observability was improving. Once the team updated dashboards and alerts, they had a more useful capacity model than they had the week before.

That’s the real upgrade here.


This post was written with help from anthropic/claude-opus-4-6

From Demo to Production: ML-Enriched Power BI in Microsoft Fabric

Microsoft published a new end-to-end pattern last week. Train a model inside Fabric. Score it against a governed semantic model. Push predictions straight into Power BI. No data exports. No credential juggling.

The blog post walks through a churn-prediction scenario. Semantic Link pulls data from a governed Power BI semantic model. MLflow tracks experiments and registers models. The PREDICT function runs batch inference in Spark. Real-time endpoints serve predictions through Dataflow Gen2. Everything lives in one workspace, one security context, one OneLake.

It reads well. It demos well.

But demo code is not production code. The gap between “it runs in my notebook” and “it runs every Tuesday at 4 AM without paging anyone” is exactly where Fabric Spark teams bleed time.

This is the checklist for crossing that gap.

Prerequisites that actually matter

The official blog assumes a Fabric-enabled workspace and a published semantic model. That is the starting line. Production is a different race.

Capacity planning comes first. Fabric Spark clusters consume capacity units. A batch scoring job running on an F64 during peak BI refresh hours competes for the same CUs your report viewers need. Run scoring in off-peak windows, or provision a separate capacity for data science workloads. Either way, know your CU ceiling before your first experiment. Discovering your scoring job throttles the CFO’s dashboard refresh is not a conversation you want to have.

Workspace isolation is not optional. Dev, test, prod. Semantic models promoted through deployment pipelines. ML experiments pinned to dev. Registered models promoted to prod only after validation passes. If your team trains models in the same workspace where finance runs their quarterly close dashboard, you are one accidental publish away from explaining why the revenue numbers just changed.

MLflow model signatures must be populated from day one. The PREDICT function requires them. No signature, no batch scoring. This constraint is easy to forget during prototyping and expensive to fix later. Make it a rule: every mlflow.sklearn.log_model call includes an infer_signature output. No exceptions. Write a pre-commit hook if you have to.

Semantic Link: the part most teams underestimate

Semantic Link connects your Power BI semantic model to your Spark notebooks. Call fabric.read_table() and you get governed data. Same measures and definitions your business users see in their reports. The data in your model’s training set matches what shows up in Power BI.

This matters more than it sounds.

Every analytics team that has been around long enough has a story about metric inconsistency. “Active customer” means one thing in the DAX model, another thing in the SQL pipeline, and a third thing in the data scientist’s Python notebook. The numbers diverge. Somebody notices. A week of forensic reconciliation follows.

Semantic Link kills that problem at the root. But only if you use it deliberately.

Start with fabric.list_measures(). Audit what DAX measures exist. Understand which ones your model depends on. Then pull data with fabric.read_table() rather than querying lakehouse tables directly. When you need to engineer features beyond what the semantic model provides, document every derivation in a version-controlled notebook. Written down and committed. Not living in someone’s memory or buried in a thread.

Training guardrails worth building

The Fabric blog shows a clean LightGBM training flow with MLflow autologging. That is the happy path. Production needs the unhappy path covered too.

Validate data before training. Check row counts against expected baselines. Check for null spikes in key columns. Check that the class distribution has not shifted beyond your predefined threshold. A model trained on corrupted or stale data produces confident garbage. Confident garbage is worse than no model at all, because people act on it.

Tag every experiment run. MLflow in Fabric supports custom tags. Use them aggressively. Tag each run with the semantic model version it pulled from, the notebook commit hash, and the data snapshot date. Three months from now, when a stakeholder asks why the model flagged 200 customers as high churn risk and zero of them actually left, you need to reconstruct exactly what happened. Without tags, you are guessing.

Build a champion-challenger gate. Before any new model version reaches production, it must beat the current model on a holdout set from the most recent data. Not any holdout set. The most recent one. Automate this comparison in a validation notebook that runs as a pipeline step before model registration. If the challenger fails to clear the margin you defined upfront, the pipeline halts. No override button. No “let’s just push it and see.” The gate exists to prevent optimism from substituting for evidence.

Batch scoring: the PREDICT function in production

Fabric’s PREDICT function is straightforward. Pass a registered MLflow model and a Spark DataFrame. Get predictions back. It supports scikit-learn, LightGBM, XGBoost, CatBoost, ONNX, PyTorch, TensorFlow, Keras, Spark, Statsmodels, and Prophet.

The production requirements are few but absolute.

Write predictions to a delta table in OneLake. Not to a temporary DataFrame that dies with the session. Partition that table by scoring date. Add a column for the model version that generated each row. This is your audit trail. When someone asks “why did customer 4471 show as high risk last Tuesday?”, you pull the partition, check the model version, and have an answer in minutes. Without that structure, the same question costs you a day.

Chain your scoring job to run after your semantic model refresh. Sequence matters. If the model scores data from the prior refresh cycle, your predictions are one step behind reality. Use Fabric pipelines to enforce the dependency explicitly. Refresh completes, scoring starts.

Real-time endpoints: know exactly what you are signing up for

Fabric now offers ML model endpoints in preview. Activate one from the model registry. Fabric spins up managed containers and gives you a REST API. Dataflow Gen2 can call the endpoint during data ingestion, enriching rows with predictions in flight.

The capability is real. The constraints are also real.

Real-time endpoints support a limited set of model flavors: Keras, LightGBM, scikit-learn, XGBoost, and (since January 2026) AutoML-trained models. PyTorch, TensorFlow, and ONNX are not supported for real-time serving. If your production model uses one of those frameworks, batch scoring is your only path.

The auto-sleep feature deserves attention. Endpoints scale capacity to zero after five minutes without traffic. The first request after sleep incurs a cold-start delay while containers spin back up. For use cases that need consistent sub-second latency, you have two options: disable auto-sleep and accept the continuous capacity cost, or send periodic synthetic requests to keep the endpoint warm.

The word “preview” is load-bearing here. Preview means the API can change between updates. Preview means SLAs are limited. Preview means you need a batch-scoring fallback in place before you route any production workflow through a real-time endpoint. Build the fallback first. Test it. Then add the real-time path as an optimization on top.

The rollback plan you need to write before you ship

Most teams build forward. They write the training pipeline, the scoring job, the endpoint, the Power BI report that consumes predictions. Then they ship.

Nobody writes the backward path. Until something goes wrong.

Your rollback plan has three parts.

First, keep at least two prior model versions in the registry. If the current version starts producing bad predictions, you roll back by updating the model alias. One API call. The scoring pipeline picks up the previous version on its next run.

Second, partition prediction tables by date and model version. Rolling back a model means nothing if downstream reports still display the bad predictions. With partitioned tables, you can filter or drop the scoring run from the misbehaving version and revert to the prior run’s output.

Third, a kill switch for real-time endpoints. One API call to deactivate the endpoint. Traffic falls back to the latest batch-scored delta table. Your Power BI report keeps working, just without real-time enrichment, while you figure out what went wrong.

Test this plan. Not on paper. Run the rollback end to end in your dev environment. Time it. If reverting to a stable state takes longer than fifteen minutes, your plan is too complicated. Simplify it until the timer clears.

Ship it

The architecture Microsoft described is sound. Semantic Link for governed data access. MLflow for experiment tracking and model registration. PREDICT for batch scoring to OneLake. Real-time endpoints for low-latency enrichment. Power BI consuming prediction tables through DirectLake or import.

But architecture alone does not keep a system running at 4 AM. The capacity plan does. The workspace isolation does. The data validation gate, the champion-challenger check, the scoring sequence, the endpoint fallback, the rollback drill. Those are what separate a demo from a service.

Do the checklist. Test the failure modes. Then ship.


This post was written with help from anthropic/claude-opus-4-6