What “Upgrade your Synapse pipelines to Microsoft Fabric with confidence (Preview)” actually means for Fabric Spark teams in production

What "Upgrade your Synapse pipelines to Microsoft Fabric with confidence (Preview)" actually means for Fabric Spark teams in production

Preview posts are written to soothe. Production teams read them like incident reviewers. They want to know what moves, what stays off, and what still needs proof before anyone re-enables a trigger.

This new migration experience is useful because it has brakes.

It lets you assess Synapse pipelines, see compatibility gaps, migrate supported pipelines into a Fabric workspace, map Synapse linked services to Fabric connections, and keep execution under control while you validate the result. That is not a one-click estate conversion. Good. One-click migration promises are how people end up explaining themselves on a call at 6 a.m.

This is triage before it is migration

The flow is split into three stages: assessment, review, and migration.

Assessment classifies each pipeline as Ready, Needs review, Coming soon, or Unsupported / Not compatible. You can export the assessment to CSV, which is more useful than it sounds. Most Synapse estates are not clean enough to reason about from memory. The CSV gives you a working list you can sort, assign, and use in a real plan.

The categories also give you an obvious first pass:

  • Ready: pilot batch.
  • Needs review: engineering work.
  • Coming soon: stop thrashing and wait for support to land.
  • Unsupported / Not compatible: redesign it.

The docs also recommend a phased approach. Start with Ready. Fix Needs review. Rerun the assessment. Sensible advice, which means some teams will try very hard to ignore it.

The Spark-specific catch is the part people will miss

If a Synapse pipeline calls Notebook activities or Spark job definition activities, Microsoft says to migrate those Spark artifacts to Fabric first.

That is the whole game for Spark teams.

If the matching Fabric notebooks or Spark job definitions already exist, the migration flow can map those activities to the Fabric items. If they do not exist yet, those activities may stay unmapped or deactivated until you create the Fabric items and update the references.

So a migrated pipeline is not automatically a runnable Spark workload. It may be a correctly copied orchestration layer that still points to nowhere useful. If your team blurs that line, you are not “almost done.” You are halfway to a very dumb cutover.

Connection mapping is where “migrated” stops meaning “ready”

The migration flow then asks you to pick a Fabric workspace and map Synapse linked services to Fabric connections.

Here the product does something smart. It does not force fake completeness. Pipelines can migrate even if every connection is not mapped. The catch is explicit: activities that use unmapped connections remain deactivated.

That is the right tradeoff. A deactivated activity is annoying. A silently broken run is worse.

This is where the human work starts:

  • make sure the right Fabric connections exist
  • validate credentials and access
  • check which activities are still deactivated
  • confirm notebook and Spark job references point to the intended Fabric items

The tool can move metadata. It cannot tell you whether your team has actually finished the migration.

“Triggers disabled by default” is the best sentence in the whole thing

After migration, triggers are disabled by default.

Perfect.

That removes one of the most common migration failure modes: an artifact gets copied, a dependency gets missed, the schedule fires anyway, and now production is teaching everyone a lesson. Keeping triggers off buys you a clean validation window.

The post-migration guidance is refreshingly sane:

  1. Validate connections and credentials.
  2. Re-enable and configure triggers as needed.
  3. Run end-to-end tests.
  4. Validate in a nonproduction environment before switching production workloads.

That is the order. Not the other way around.

There is one smaller operational detail worth noting. Migrated pipelines appear in the Fabric workspace with the source factory name prefixed. That helps when you are reviewing a mixed estate and trying to keep lineage straight.

What this preview changes

It does not finish the migration for you. It does make the early part less chaotic.

You get a readiness assessment instead of guesswork. You get a phased path instead of a big-bang leap. You get visible connection mapping. You get deactivated activities when dependencies are missing. You get triggers held back until you choose to turn them on.

That is real value. It turns migration from “hope plus calendar pressure” into something you can audit.

A rollout pattern worth trusting

If I were running this for a production Fabric Spark estate, I would keep it brutally simple.

  1. Migrate notebooks and Spark job definitions to Fabric first.
  2. Run the pipeline assessment and export the CSV.
  3. Start with Ready pipelines that already have their Fabric Spark counterparts in place.
  4. Map linked services to Fabric connections and treat every deactivated activity as unfinished work.
  5. Run end-to-end tests in nonproduction. Compare outputs, parameters, logging, and failure handling.
  6. Re-enable triggers only after the pipeline and its Spark dependencies survive contact with reality.
  7. Then work through the Needs review backlog and rerun assessment as you clear items.

It is not glamorous. It is how you keep a migration from turning into a weekly apology.

The practical takeaway

This preview matters because it is honest about the order of operations.

For Spark-heavy Synapse estates, the job is not “move everything to Fabric.” The job is “move Spark artifacts first, move orchestration second, validate connections and behavior, then turn execution back on.” The new experience supports that sequence instead of pretending the sequence does not matter.

So no, this is not a teleportation device for legacy pipelines. It is a staging area with guardrails. For teams running Spark in production, that is much more useful.

This post was written with help from anthropic/claude-opus-4-6

What the February 2026 gateway release really means for Fabric Spark teams

What the February 2026 gateway release really means for Fabric Spark teams

Monthly gateway release posts are usually the corporate equivalent of dry toast. A version number appears. Power BI Desktop compatibility gets a polite bow. Then everyone goes back to moving data and arguing with refresh logs.

The February 2026 on-premises data gateway release is mostly that kind of update. Microsoft says the build is 3000.306, and the point is simple: keep the gateway aligned with the February 2026 Power BI Desktop release so reports refreshed through the gateway use the same query execution logic and runtime as Desktop.

Useful? Yes. Dramatic? Not even a little.

What makes this release worth a Spark team’s time is everything happening around it. In the last few months, Microsoft added manual gateway updates, shipped pipeline performance work in January, and expanded managed private endpoint guidance for Fabric Data Engineering workloads. Put together, those changes tell a clearer story than the February post does on its own: the gateway still matters, but it is no longer background plumbing you patch whenever someone remembers.

The February release itself is small

The official February announcement is short and very Power BI flavored. Version 3000.306 brings the gateway up to date with the February 2026 Power BI Desktop release. That matters if your Spark world touches gateway-mediated refresh or movement of data through Fabric services that depend on the gateway.

If your team uses Spark notebooks or Spark job definitions alongside pipelines, semantic models, or refresh paths that still run through the on-premises data gateway, version alignment is not glamorous, but it is part of keeping production boring. And boring is what you want from production. “Interesting” is how incident reviews begin.

There is also an awkward timing detail here. The Microsoft Learn page for supported gateway versions already lists March 2026, build 3000.310, as the latest supported update. So if you are making an upgrade decision today, the practical move is not to cling to 3000.306 out of loyalty to February. The real lesson from February is that the monthly update train keeps moving, and Spark teams need an operating habit for that cadence.

December changed the maintenance story

The bigger operational shift arrived in the December 2025 release, build 3000.298. That release introduced Manual Update for On-premises Data Gateway in preview. Microsoft says admins can trigger updates from the gateway UI or programmatically through API or script, and the related documentation shows the PowerShell path with Update-DataGatewayClusterMember.

That may sound like a small administrative nicety. It is not. It is the difference between “we update the gateway when someone notices” and “we update the gateway during a planned window, on purpose, with a record of what happened.”

Microsoft’s update documentation is blunt about why this matters in clusters. When gateway members run different versions, you can get sporadic failures because one member can handle a query that another cannot. The guidance is to disable one member, let the work drain, update it, re-enable it, and repeat for the rest of the cluster. That is not fancy advice. It is good advice. Production systems usually break in ordinary, irritating ways.

Two details matter:

  • The November 2025 release is the baseline for the manual update feature.
  • Microsoft says the updater service activates only when an update is triggered from the UI or via PowerShell.

In other words, December did not add one more button. It added a more controlled update path for teams that have to care about maintenance windows, change management, and not getting yelled at on a Friday night.

January made the gateway more relevant to pipeline-heavy Spark teams

The January 2026 release, build 3000.302, was modest on paper but more interesting in practice. Microsoft called out two improvements:

  • Performance optimization for reading CSV format in Copy job and Pipeline activities
  • Performance optimization for read and write through adaptive performance tuning capability in Pipeline

That is not a fireworks show, but it is more concrete than the average release note. If your Fabric Spark workflow begins with Copy jobs or Pipeline activities that pull CSV-shaped data before Spark takes over, January was the sort of release you should benchmark instead of shrugging at.

Notice what Microsoft did not say: there is no grand promise that everything is suddenly twice as fast and angels now sing over your lakehouse. Fine. Release notes rarely sing. Still, when a gateway sits in front of repetitive ingestion work, even a dull-sounding optimization can shave time off every run. Boring improvements are often the ones that pay rent.

Spark teams now have a second route for on-premises access

The most interesting shift is not in the gateway release notes at all. It is in Fabric’s managed private endpoint work for Data Engineering workloads.

Microsoft’s October 2025 Fabric blog post says Managed Private Endpoints support for connecting to Private Link Services became available through the Fabric Public REST APIs, specifically to help Fabric Spark compute reach on-premises and network-isolated data sources. The newer Learn guidance goes further: Fabric workloads such as Spark or Data Pipelines can connect to on-premises or custom-hosted sources through an approved Private Link setup, with traffic flowing through the Microsoft backbone network rather than the public internet.

That is a real architectural fork in the road.

If your team has treated the on-premises data gateway as the default answer to any sentence containing the words “on-premises” and “Fabric,” that default deserves another look. The managed private endpoint docs say that, once approved, Fabric Data Engineering workloads such as notebooks, Spark job definitions, materialized lakeviews, and Livy endpoints can securely connect to the approved resource.

That does not kill the gateway. It does mean the gateway is no longer the only respectable adult in the room.

There is also one gotcha that will ambush people who like clicking around until things work. Microsoft says creating a managed private endpoint with a fully qualified domain name through Private Link Service is supported only through the REST API, not the UX. So if your plan is “we’ll set it up later in the portal,” later may arrive carrying disappointment.

What a Fabric Spark team should do next

If I were cleaning this up for a real production team, the to-do list would look like this:

  1. Check the supported monthly updates page before touching anything. As of late March 2026, it already lists March 2026, build 3000.310, as the newest supported gateway release.
  2. If you run a gateway cluster, stop tolerating version drift. Follow Microsoft’s member-by-member update guidance so one node does not become the office goblin that fails queries the others can run.
  3. If you want controlled upgrades, confirm your gateways are on the November 2025 baseline or later, then script manual updates with Update-DataGatewayClusterMember.
  4. Inventory which Spark-adjacent workloads really need the gateway and which ones are gateway-shaped only because nobody revisited the design.
  5. For Spark or Data Pipeline scenarios that need private access to on-premises or custom-hosted sources, evaluate managed private endpoints and Private Link Service instead of assuming the gateway must stay in the middle.
  6. If your ingestion path leans on CSV through Copy jobs or Pipeline activities, test the January build improvements against your actual workloads rather than trusting vague optimism.

One more limitation matters here. The managed private endpoint overview says the feature depends on Fabric Data Engineering workload support in both the tenant home region and the capacity region. So before anyone gives a triumphant architecture presentation, check whether your region setup actually supports what you plan to do.

The short version

The February 2026 gateway release is a small compatibility release. On its own, it would barely justify a coffee break. For Fabric Spark teams, though, it lands in the middle of a more meaningful change.

Gateway maintenance is becoming easier to control. Pipeline-oriented gateway work picked up performance tuning in January. And Spark workloads now have a documented private-connectivity path that can bypass the old habit of stuffing every on-premises access pattern through the gateway.

So no, February 2026 was not a blockbuster. It was a signpost. The smart move is to stop treating the gateway as an untouchable default, update it like you mean it, and decide workload by workload whether Spark still needs that middleman.

If you want the raw source material rather than anyone’s interpretation, start here:

This post was written with help from anthropic/claude-opus-4-6

Operationalizing the semantic model permissions update for Fabric data agents

Operationalizing the semantic model permissions update for Fabric data agents

Permissions in data platforms have a remarkable talent for turning a two-minute job into a small municipal drama. You want one ordinary thing. The system hands you a form, a role, a workspace, another role, and, sooner or later, a person named Steve who is out until Thursday.

Starting April 6, 2026, Microsoft Fabric removes one of those little absurdities. Creators and consumers of Fabric data agents need only Read access on the semantic model to use it through a data agent. Workspace access is no longer required.

Small sentence. Large relief.

Why this matters

Fabric data agents use Azure OpenAI to interpret a user’s question, choose the most relevant source, and generate, validate, and execute the query needed to answer it. That source might be a lakehouse, warehouse, Power BI semantic model, KQL database, or ontology.

So the agent is already doing the interesting work. It is translating a human question into something a data system can answer. Requiring extra workspace access just to reach a semantic model added bureaucracy to the wrong layer.

The change, plainly

The official change is simple: beginning April 6, creators and consumers only need Read access on the semantic model to interact with it through a Fabric data agent. The older workspace access and Build permission hurdle disappears for this path.

If you have ever untangled access requests, you can probably hear the sigh from here.

What to do with that information

The first operational question is not “What new permission do I need?” It is “Which workspace grants exist only because the old rule forced them?”

Start there.

  • List the semantic models your data agents use.
  • Identify users or groups with workspace access granted only for those agent scenarios.
  • Test the new flow with a read-only user as April 6 approaches.
  • After the change lands, remove workspace access that no longer serves a separate purpose.

This is not glamorous work. Neither is plumbing, and everyone suddenly develops strong feelings about plumbing when it breaks.

The part people will miss

One detail matters more than the permission change itself. When a Fabric data agent generates DAX for a semantic model, it relies only on the model’s metadata and Prep for AI configuration. It ignores instructions added at the data agent level for DAX query generation.

That puts responsibility where it belongs: on the model.

If a business user asks a sensible question and gets a crooked answer, the fix is usually not a cleverer agent prompt. The fix is to improve what the model gives the agent to work with: the metadata and the Prep for AI setup.

That is the real operational shift. Access gets easier. Model preparation matters more.

A sensible rollout

If you own Fabric governance, keep the rollout dull and methodical.

  • Review which data agents rely on semantic models.
  • Retest those scenarios with users who have Read access on the model and no workspace access.
  • Inspect the models that produce weak DAX and improve the metadata and Prep for AI configuration they expose.
  • Clean up workspace permissions that were granted only to satisfy the old requirement.

Nobody frames that checklist and hangs it in the lobby. It still gets the job done.

The useful conclusion

The best part of this update is that it removes a fake dependency. A data agent that can answer questions from a semantic model should not require a side trip through workspace permissions.

The catch is that the agent still cannot invent a well-prepared model out of thin air. Fabric has made access lighter. It has also made the remaining truth easier to see: if you want better answers, the semantic model has to be ready for the job.

Which is, frankly, how this should have worked all along.

This post was written with help from anthropic/claude-opus-4-6

What “Recent data” in Fabric means for Spark teams when time is the real bottleneck

At 8:07 a.m., nobody on a data engineering team is debating architecture purity. You’re trying to get back to the exact source you were fixing yesterday before another downstream notebook fails and somebody asks for an ETA.

That’s the problem Microsoft Fabric’s Recent data feature targets.

The feature landed in the February 2026 Fabric update and is currently in preview. It sounds small: Dataflow Gen2 remembers the specific items you used recently — tables, files, folders, databases, and sheets — and lets you load them directly into the editing canvas. For Spark-heavy teams, though, this is less of a UX tweak and more of a way to stop bleeding time in the first mile of work.

And yes, it’s still a preview feature. Treat it like a mountain route in unstable weather: useful, fast, and not something you trust blindly.

Why Spark teams should care about a Dataflow feature

A lot of Spark teams still frame Dataflow Gen2 as somebody else’s tool. That framing is outdated.

Dataflow Gen2 automatically creates staging Lakehouse and Warehouse items in your workspace. If your team’s workflow includes Dataflow-based ingestion and Spark-based transformation, the handoff between those steps is real. It’s your daily route.

Here’s the hard lesson: if your ingestion layer touches Dataflow Gen2, then UI friction inside Dataflow is your Spark team’s problem too.

What to do about it:

  • Write down your ingestion handoffs in plain language: source to Dataflow Gen2 to staging Lakehouse/Warehouse to Spark notebooks.
  • Mark where engineers repeatedly reconnect to the same sources. That’s where Recent data pays off first.

What Recent data changes under pressure

Recent data does one thing that matters: it remembers specific assets, not just abstract connections.

When you return to a fix, you’re not restarting the expedition from base camp. You get dropped closer to the problem. You can pull the item directly into the editing canvas and keep moving.

For teams, this changes the rhythm of incident response and iteration:

  • You get back to source-level corrections faster.
  • You reduce the chance that someone reconnects to the wrong similarly-named object while moving too fast.
  • You spend less team energy on navigation and more on data correctness.

None of this is glamorous. It’s also exactly where engineering throughput gets won.

Try this: during your next defect cycle, track one metric for a week — time from “issue found” to “source query/table reopened in Dataflow Gen2.” If that number drops after using Recent data, keep leaning in. If it doesn’t, your bottleneck is elsewhere.

What this feature doesn’t rescue you from

Teams love to over-credit new features. Recent data is a navigation accelerator. It’s not governance. It’s not validation. It’s not a replacement for naming discipline. And because it’s in preview, it’s not a foundation for critical operational assumptions.

If your source naming is chaotic, Recent data will surface chaos faster.

If your validation is weak, Recent data will help you ship mistakes sooner.

If your runbooks are vague, Recent data won’t magically teach new engineers what “correct” looks like.

Pair it with a minimum Spark validation pass after ingestion updates: schema check, null expectation, row-count sanity check. Keep this lightweight and repeatable. The point is fast feedback, not ceremony.

Preview discipline: run this like a survival checklist

Because Recent data is in preview, your team should operate with explicit guardrails.

Test in development first. Don’t roll workflow assumptions into production muscle memory before your team has used the feature in real edits.

Keep a source-of-truth map. Recent data is convenience. Your documented source map is control. Keep both.

Standardize names now. If a human can confuse two source objects at a glance, they will. Fix names before speed amplifies mistakes.

Define a fallback path. If the recent list doesn’t have what you need, nobody should improvise. Document the manual reconnect path and keep it current.

Review preview behavior monthly. If the feature behavior shifts while in preview, your team should notice fast and adjust intentionally. Assign one owner for “preview watch” each month. Their job: test the core flow, confirm assumptions still hold, alert the team if anything drifts.

The operating model for Spark leads

If you lead a Spark data engineering team, the decision is straightforward.

Use Recent data. Absolutely use it. But use it like a rope, not like wings.

A rope gets you through rough terrain faster when the team is clipped in, communicating, and following route discipline. Wings are what people imagine they have right before they step into empty air.

In practice:

  • Adopt the feature for speed.
  • Keep your documentation for continuity.
  • Keep naming conventions strict for safety.
  • Keep Spark-side validation for quality.
  • Treat preview status as a real risk signal, not legal fine print.

That combination is where this feature becomes meaningful. Not because it’s flashy. Because it removes repeated friction at exactly the point where your team loses focus, burns time, and compounds small mistakes.

In data engineering, the catastrophic failures usually start as tiny oversights repeated at scale. Recent data removes one class of those oversights — the constant re-navigation tax — but only if you wrap it in disciplined operating habits.

One less avoidable stumble on steep ground, so your team can spend its strength on the parts of the climb that actually require judgment.


This post was written with help from anthropic/claude-opus-4-6

Open Mirroring + OneLake: Spark patterns that keep latency from eating your weekends

Open Mirroring + OneLake: Spark patterns that keep latency from eating your weekends

Dev is clean. Prod is chaos. In dev, your mirrored table has a cute little dataset and Spark tears through it. In prod, that same notebook starts wheezing like it ran a marathon in wet jeans.

If that sounds familiar, good. You’re not cursed. You’re running into architecture debt that Open Mirroring does not solve for you.

Open Mirroring in Microsoft Fabric does exactly what it says on the tin: it replicates data from external systems into OneLake as Delta tables, and schema changes in the source can flow through. That’s huge. It cuts out a pile of ingestion plumbing.

But mirroring only lands data. It does not guarantee your Spark reads will be fast, stable, or predictable. That’s your job.

This post is the practical playbook: what breaks, why it breaks, and the patterns that keep your Spark jobs from turning into slow-motion disasters.

first principle: mirrored is a landing zone, not a serving layer

Treat mirrored tables like an airport runway. Planes touch down there. People do not set up a picnic on the tarmac.

When teams read mirrored tables directly in hot-path jobs, they inherit whatever file layout the connector produced. Sometimes that layout is fine. Sometimes it is a junk drawer.

Spark is sensitive to this. Reading many tiny files creates scheduling and metadata overhead. Reading a few huge files kills parallelism. Either way, the cluster burns time doing the wrong work.

The fix is boring and absolutely worth it: add a curated read layer.

  1. Let Open Mirroring write into a dedicated mirror lakehouse.
  2. Run a post-mirror notebook that reshapes data for Spark (partitioning, compaction, cleanup).
  3. Have production notebooks read curated tables only.

One extra hop. Much better nights of sleep.

what actually causes the latency cliff

Two things usually punch you in the face at scale:

  • File layout drift
  • Schema drift

Let’s tackle them in order.

1) file layout drift (the silent killer)

Spark scheduling is roughly file-driven for Parquet/Delta scans. That means file shape becomes execution shape. If your table has wildly uneven files, your job speed is set by the stragglers.

Think of ten checkout lanes where nine customers have one item and one customer has a full garage sale cart. Everyone waits on that last lane.

Start by measuring file distribution, not just row counts.

from pyspark.sql import functions as F

# NOTE: inputFiles() returns a Python list of file paths
df = spark.read.format("delta").load("Tables/raw_mirrored_orders")
paths = df.inputFiles()

# Use Hadoop FS to get file sizes in bytes
jvm = spark._jvm
conf = spark._jsc.hadoopConfiguration()
fs = jvm.org.apache.hadoop.fs.FileSystem.get(conf)

sizes = []
for p in paths:
    size = fs.getFileStatus(jvm.org.apache.hadoop.fs.Path(p)).getLen()
    sizes.append((p, size))

size_df = spark.createDataFrame(sizes, ["path", "size_bytes"])

size_df.select(
    F.count("*").alias("file_count"),
    F.round(F.avg("size_bytes")/1024/1024, 2).alias("avg_mb"),
    F.round(F.expr("percentile_approx(size_bytes, 0.5)")/1024/1024, 2).alias("p50_mb"),
    F.round(F.expr("percentile_approx(size_bytes, 0.9)")/1024/1024, 2).alias("p90_mb"),
    F.round(F.max("size_bytes")/1024/1024, 2).alias("max_mb")
).show(truncate=False)


You want a tight-ish band, not chaos. A common rule of thumb is targeting roughly 128 MB to 512 MB Parquet files for balanced throughput and parallelism. Rule of thumb, not religion. Your workload decides final tuning.

Then enforce a sane shape in curated tables:

raw = spark.read.format("delta").load("Tables/raw_mirrored_orders")

(raw.write
    .format("delta")
    .mode("overwrite")
    .partitionBy("order_date")         # choose columns your queries actually filter on
    .option("maxRecordsPerFile", 500000)
    .save("Tables/curated_orders"))

spark.sql("OPTIMIZE delta.`Tables/curated_orders`")


If your queries filter by date and region, but you partition by customer_id because it “felt right,” you built a latency trap with your own hands.

2) schema drift (the 3 a.m. pager)

Open Mirroring can propagate source schema changes. That’s useful and dangerous.

Useful because your lake stays aligned. Dangerous because downstream logic often assumes a fixed shape.

A nullable column addition is usually fine. A type shift on a key metric column can quietly corrupt aggregations or explode at runtime.

Do not “notice this later.” Gate on it.

from pyspark.sql.types import StructType
import json

# Store baseline schema as JSON in Files/schemas/orders_baseline.json
with open("/lakehouse/default/Files/schemas/orders_baseline.json", "r") as f:
    baseline = StructType.fromJson(json.load(f))

current = spark.read.format("delta").load("Tables/raw_mirrored_orders").schema

base = {f.name: str(f.dataType) for f in baseline.fields}
curr = {f.name: str(f.dataType) for f in current.fields}

type_changes = [
    f"{name}: {base[name]} -> {curr[name]}"
    for name in curr
    if name in base and base[name] != curr[name]
]

new_cols = [name for name in curr if name not in base]

if type_changes:
    raise ValueError(f"Schema type changes detected: {type_changes}")

# Optional policy: allow new nullable columns but log them
if new_cols:
    print(f"New columns detected: {new_cols}")


Policy matters more than code here. Decide in advance what is auto-accepted versus what blocks the pipeline. Write it down. Enforce it every run.

lag is real, even when everything is healthy

Mirroring pipelines are replication systems, not teleportation devices. There is always some delay between source commit and mirrored availability. Sometimes tiny. Sometimes not.

If your job blindly processes “last hour” windows without checking mirror freshness, you’ll create holes and call them “data quality issues” three weeks later.

Add a freshness gate before transformations. The metadata source is connector-specific, but the pattern is universal:

from datetime import datetime, timedelta, timezone

# Example only: use the metadata table/view exposed by your mirroring setup
last_mirror_ts = spark.sql("""
  SELECT max(replication_commit_ts) as ts
  FROM mirror_metadata.orders_status
""").collect()[0]["ts"]

required_freshness = datetime.now(timezone.utc) - timedelta(minutes=15)

if last_mirror_ts is None or last_mirror_ts < required_freshness:
    raise RuntimeError(
        f"Mirror not fresh enough. Last commit: {last_mirror_ts}, required after: {required_freshness}"
    )


No freshness, no run. That one line saves you from publishing confident nonsense.

the production checklist (use this before go-live)

Before promoting any mirrored-data Spark pipeline, run this checklist in the same capacity and schedule window as production:

  • File shape check
  • Measure file count and distribution (p50, p90, max).
  • If distribution is ugly, compact and rewrite in curated.
  • Partition sanity check
  • Confirm partitions match real filter predicates.
  • Use df.explain(True) and verify PartitionFilters is not empty for common queries.
  • Schema gate check
  • Compare current schema to baseline.
  • Fail on type changes unless explicitly approved.
  • Freshness gate check
  • Validate mirrored data is fresh enough for your downstream SLA.
  • Fail fast if not.
  • Throughput reality check
  • Time representative full and filtered scans from curated tables.
  • If runtime misses SLA, fix layout first, then tune compute.

If you only do one thing from this list, do the curated layer. Direct reads from mirrored tables are the root of most performance horror stories.

architecture that holds up when volume gets ugly

Keep it simple:

  1. Mirror layer
    Open Mirroring lands source data in OneLake Delta tables. This is your raw replica.

  2. Curation job
    A scheduled Spark notebook validates schema, reshapes partitions, and compacts files.

  3. Curated layer
    Downstream Spark notebooks and SQL consumers read here, not from mirror tables.

  4. Freshness gate
    Every downstream run checks replication freshness before processing.

That’s it. No heroics. No mystery knobs. Just a clean boundary between “data landed” and “data is ready to serve.”

Open Mirroring is genuinely powerful, but it is not magic. If you treat mirrored tables as production-ready serving tables, latency will eventually kneecap you. If you treat them as a landing zone and curate aggressively, Spark behaves, stakeholders stay calm, and your weekends stay yours.

This post was written with help from anthropic/claude-opus-4-6

What “Execute Power Query Programmatically” Means for Fabric Spark Teams

Somewhere in a Fabric workspace right now, two teams are maintaining the same transformation twice.

The BI team owns it in Power Query. The Spark team rewrote it in PySpark so a notebook could run it on demand. Both versions work. Both versions drift. Both versions break at different times.

That was normal.

Microsoft’s new Execute Query API (preview) is the first real shot at ending that duplication. It lets you execute Power Query (M) through a public REST API from notebooks, pipelines, or any HTTP client, then stream results back in Apache Arrow format.

For Spark teams, this isn’t a minor feature. It changes where transformation logic can live.

What actually shipped

At a technical level, the API is simple:

  • Endpoint: POST /v1/workspaces/{workspaceId}/dataflows/{dataflowId}/executeQuery
  • Input: a queryName, with optional customMashupDocument (full M script)
  • Output: Arrow stream (application/vnd.apache.arrow.stream)

The execution context comes from a Dataflow Gen2 artifact in your workspace. Its configured connections determine what data sources the query can access and which credentials are used.

That single detail matters more than it looks. You’re not just “calling M from Spark.” You’re running M under dataflow-governed connectivity and permissions.

Why Spark engineers should care

Before this API, Spark teams usually had two options:

  • Rewrite M logic in PySpark
  • Or wait for a dataflow refresh and consume the output later

Neither is great. Rewrites create long-term maintenance debt. Refresh handoffs add latency and orchestration fragility.

Now you can execute the transformation inline and keep moving.

A minimal call path looks like this:

import requests
import pyarrow as pa

response = requests.post(url, headers=headers, json=request_body, stream=True)

with pa.ipc.open_stream(response.raw) as reader:
    pandas_df = reader.read_pandas()

spark_df = spark.createDataFrame(pandas_df)


No CSV hop. No JSON schema drift. No custom parsing layer.

The non-negotiable constraints

This feature is useful, but it is not magic. There are hard boundaries.

  1. 90-second timeout
    – Query evaluations must complete within 90 seconds.
    – This is ideal for fast lookups, enrichment, and reference joins—not heavy batch reshaping.

  2. Read-only execution
    – The API executes queries only. It doesn’t support write actions.
    – If your notebook flow assumes “query + write” in one API step, redesign it.

  3. Native query rule for custom mashups
    customMashupDocument does not allow native database queries.
    – But if a query defined inside the dataflow itself uses native queries, that query can be executed.
    – This distinction will trip people if they treat inline M and stored dataflow queries as equivalent.

  4. Performance depends on folding and query complexity
    – Bad folding or expensive transformations can burn your 90-second window quickly.
    – You need folding-aware query reviews before production rollout.

Practical rollout plan for Spark teams

If you lead a Fabric Spark team, do this in order.

1) Inventory duplication first

Build a short list of transformations currently duplicated between M and PySpark. Start with transformations that are stable, reused often, and mostly read-oriented.

2) Stand up a dedicated execution dataflow

Create one Dataflow Gen2 artifact specifically for API-backed execution contexts.

  • Keep connections explicit and reviewed
  • Restrict who can modify those connections
  • Treat the artifact like infrastructure, not ad hoc workspace clutter

3) Wrap Execute Query behind one notebook utility

Don’t let every notebook hand-roll HTTP logic. Create one shared helper that handles:

  • token acquisition
  • request construction
  • Arrow stream parsing
  • error handling
  • timeout/response logging

If the API returns 202 (long-running operation), honor Location and Retry-After instead of guessing polling behavior.

4) Add governance checks before scale

Because execution runs under dataflow connection scope, validate:

  • who can execute
  • what connections they indirectly inherit
  • which data sources become reachable through that path

If your governance model assumes notebook identity is the only control plane, this API changes that assumption.

5) Monitor capacity from day one

Microsoft surfaces this usage in Capacity Metrics as “Dataflows Gen2 Run Query API”, billed on the same meter family as Dataflow Gen2 refresh operations. Watch this early so you don’t discover new spend after adoption is already wide.

Where this fits (and where it doesn’t)

Use it when you need:

  • shared transformation logic between BI and engineering
  • fast, read-oriented query execution from Spark/pipelines/apps
  • connector and gateway reach already configured in dataflows

Avoid it when you need:

  • long-running transformations
  • write-heavy jobs
  • mission-critical production paths with zero preview risk tolerance

The REST API docs still mark this as preview and “not recommended for production use.” Treat that warning as real, not ceremonial.

The organizational shift hiding behind the API

The technical win is straightforward: fewer rewrites, faster integration, cleaner data handoffs.

The harder change is social.

When Spark notebooks can directly execute M, ownership lines between BI and data engineering need to be explicit. Who owns business logic? Who owns runtime reliability? Who approves connection scope?

Teams that answer those questions early will move fast.

Teams that don’t will just reinvent the same duplication problem with a new endpoint.


Source notes

This post was written with help from anthropic/claude-opus-4-6.

Keeping Spark, OneLake, and Mirroring Reliable in Microsoft Fabric

The alert fired at 2:14 AM on a Tuesday. A downstream Power BI report had gone stale — the Direct Lake dataset hadn’t refreshed in six hours. The on-call engineer opened the Fabric monitoring hub and found a cascade: three Spark notebooks had completed without triggering downstream freshness checks, a mirrored database was five hours behind, and the OneLake shortcut connecting them was returning intermittent 403 errors. It went undetected until a VP’s morning dashboard showed yesterday’s numbers.

That scenario is stressful, but it’s also solvable. These issues are usually about observability gaps between services, not broken fundamentals. If you’re running Spark workloads against OneLake with mirroring in Microsoft Fabric, you’ll likely encounter some version of this under real load. The key is having an operational playbook before it happens.

What follows is that playbook — assembled from documented production incidents, community post-mortems, and repeatable operating patterns from teams running this architecture at scale.

How Spark, OneLake, and mirroring connect (and where they don’t)

The dependency chain matters because issues can cascade through it in non-obvious ways.

Your Spark notebooks write Delta tables to OneLake lakehouses. Those tables might feed Direct Lake datasets in Power BI. Separately, Mirroring can replicate data from external sources — Azure SQL Database, Cosmos DB, Snowflake, and others — into OneLake as Delta tables. Shortcuts bridge lakehouses or reference external storage.

What makes this operationally nuanced: each layer has its own retry logic, auth tokens, and completion semantics. A Spark job can succeed from its own perspective (exit code 0, no exceptions) while the data it wrote is temporarily unavailable to downstream consumers because of a metadata sync delay. Mirroring can pause during source throttling and may not raise an immediate alert unless you monitor freshness directly. Shortcuts can go stale when target workspace permissions change.

You can end up with green pipelines and incomplete data. The gap between “the job ran” and “the data arrived correctly” is where most reliability work lives.

Detection signals you actually need

The first mistake teams make is relying on Spark job status alone. A job that completes successfully but writes zero rows, hits an unmonitored schema drift, or writes to the wrong partition is still a data quality issue.

Here’s what to watch instead:

Row count deltas. After every notebook run, compare the target table’s row count against expected intake. It doesn’t need to be exact — a threshold works. If the delta table grew by less than 10% of its average daily volume, fire a warning. Three lines of Spark SQL at the end of your notebook. Five minutes to implement. It prevents empty-table surprises at 9 AM.

OneLake file freshness. The _delta_log folder in your lakehouse tables contains JSON commit files with timestamps. If the most recent commit is older than your pipeline cadence plus a reasonable buffer, investigate. A lightweight monitoring notebook that scans these timestamps across key tables takes about twenty minutes to build.

Mirroring lag via canary rows. The monitoring hub shows mirroring status, but the granularity is coarse. For external databases, set up a canary: a table in your source that gets a timestamp updated every five minutes. Check that timestamp on the OneLake side. If the gap exceeds your SLA, you know mirroring is stalled before your users do.

Shortcut health checks. Shortcuts can degrade quietly when no direct check exists. A daily job that reads a single row from each shortcut target and validates the response catches broken permissions, expired SAS tokens, and misconfigured workspace references before they cause real damage.

Failure mode 1: the Spark write that succeeds but isn’t queryable yet

You’ll see this in Fabric notebook logs as a clean run. The Spark job processed data, performed transformations, called df.write.format("delta").mode("overwrite").save(). Exit code 0. But the data isn’t queryable from the SQL analytics endpoint, and Direct Lake still shows stale numbers.

What happened: the SQL analytics endpoint runs a separate metadata sync process that detects changes committed to lakehouse Delta tables. According to Microsoft’s documentation, under normal conditions this lag is less than one minute. But it can occasionally fall behind — sometimes significantly. The Fabric community has documented sync delays stretching to hours, particularly during periods of high platform load or when tables have large numbers of partition files.

This is the gap that catches teams off guard. The Delta commit landed in storage, but the SQL endpoint hasn’t picked it up yet.

Triage sequence:

  1. Open the lakehouse in Fabric and check the table directly via the lakehouse explorer. If the data appears there but not in the SQL endpoint, you’ve confirmed a metadata sync lag.
  2. Check Fabric capacity metrics. If your capacity is throttled (visible in the admin portal under capacity management), metadata sync can be deprioritized. Burst workloads earlier in the day can surface as sync delays later.
  3. Force a manual sync. In the SQL analytics endpoint, select “Sync” from the table context menu. You can also trigger this programmatically — Microsoft released a Refresh SQL Analytics Endpoint Metadata REST API (preview as of mid-2025), and it’s also available through the semantic-link-labs Python package.

Remediation: Add a post-write validation step to your notebooks. After writing the Delta table, wait 30 seconds, then query the SQL analytics endpoint for the max timestamp or row count. If it doesn’t match what you wrote, log a warning and retry the sync. If after three retries it still diverges, fail the pipeline explicitly so your alerting catches it. Don’t let a successful Spark job mask a downstream data gap.

Failure mode 2: mirroring goes quiet

Mirroring is genuinely useful for getting external data into OneLake without building custom pipelines. But one common reliability pattern is that replication can stall when the source system throttles or times out, and the monitoring hub may still show “Running” while data freshness drifts.

This pattern is often observed with Azure SQL Database sources during heavy read periods. The mirroring process opens change tracking connections that compete with production queries. When the source database gets busy, it can throttle the mirroring connection, and Fabric retry logic may back off for extended periods without immediately surfacing a hard error.

Triage sequence:

  1. Check mirroring status in the monitoring hub, but prioritize the “Last synced” timestamp over the status icon. “Running” with a last-sync time of four hours ago still indicates a problem.
  2. Check the source database’s connection metrics. If you’re mirroring from Azure SQL, look at DTU consumption and connection counts around the time mirroring lag increased. There’s often a correlation with a batch job or reporting burst.
  3. Inspect table-level mirroring status. Individual tables can fall behind while others sync normally. The monitoring hub aggregates this, which can hide partial lag.

Remediation: The canary-row pattern is your early warning system. For prevention, stagger heavy source-database workloads away from mirroring windows. If your Azure SQL is Standard tier, increasing DTU capacity or moving to Hyperscale gives mirroring more room. On the Fabric side, stopping and restarting mirroring resets the connection and forces a re-sync when retry backoff has become too aggressive.

Failure mode 3: shortcut permissions drift

Shortcuts are the connective tissue of OneLake — references across lakehouses, workspaces, and external storage without copying data. They deliver huge flexibility, but they benefit from explicit permission and token hygiene.

A common failure pattern: a shortcut that worked for months suddenly returns 403 errors or empty results. Spark notebooks that read from the shortcut either fail with ADLS errors or complete with zero rows if downstream checks aren’t strict.

Root causes, ranked by observed frequency in the field:

  1. A workspace admin changed role assignments, and the identity the shortcut was created under lost access. Usually accidental.
  2. For ADLS Gen2 shortcuts: the SAS token expired, or storage account firewall rules changed.
  3. Cross-tenant shortcuts relying on Entra ID B2B guest access. If guest policy changes on either tenant, shortcuts can break without a prominent Fabric notification.

Triage sequence:

  1. Open the shortcut definition in the lakehouse — Fabric shows a warning icon on broken shortcuts, but only in the lakehouse explorer.
  2. Test the shortcut target independently. Can you access the target lakehouse or storage account directly with the same identity? If not, it’s a permissions issue.
  3. For ADLS shortcuts, check storage account access logs in Azure Monitor. Look for 403 responses from Fabric service IP ranges.

Remediation: Use service principals with dedicated Fabric permissions rather than user identities for shortcuts. Set up a token rotation calendar with 30-day overlap between old and new tokens so you’re never caught by a hard expiration. Then keep a daily shortcut health-check job that reads one row from each shortcut target and validates expected row counts.

Failure mode 4: capacity throttling disguised as five different problems

This one is tricky because it can look like unrelated issues at once. Spark jobs slow down. Metadata syncs lag. Mirroring falls behind. SQL endpoint queries time out. Power BI reports go stale. Troubleshoot each symptom in isolation and you’ll end up looping.

The common thread: your Fabric capacity hit its compute limits and started throttling. Fabric uses a bursting and smoothing model — you can temporarily exceed your purchased capacity units, but that overuse gets smoothed across future time windows. The system recovers by throttling subsequent operations. A heavy Spark job at 10 AM can degrade Power BI performance at 3 PM unless capacity planning accounts for that delayed impact.

Triage sequence:

  1. Open the capacity admin portal and look at the CU consumption graph. Sustained usage above 100% followed by throttling bands is your signal.
  2. Identify top CU consumers. Spark notebooks and materialization operations (Direct Lake refreshes, semantic model processing) tend to be the heaviest. Capacity metrics break this down by workload type.
  3. Check the throttling policy and current throttling state. Fabric throttles interactive workloads first when background usage exceeds limits — meaning end users feel pain from batch jobs they never see.

Remediation: Separate workloads by time window. Push heavy Spark processing to off-peak hours. If you can’t shift the schedule, split workloads across multiple capacities — batch on one, interactive analytics on another. Set CU consumption alerts at 80% of capacity so you get warning before throttling starts.

For bursty Spark demand, also evaluate Spark Autoscale Billing. In the current Fabric model, Autoscale Billing is opt-in per capacity and runs Spark on pay-as-you-go serverless compute, so Spark jobs don’t consume your fixed Fabric CU pool. That makes it a strong option for ad-hoc spikes or unpredictable processing windows where manual SKU up/down management is too slow.

If your workload is predictable, pre-scaling SKU windows (for example, F32 to F64 before a known processing block) can still be effective — just manage cost guardrails and rollback timing tightly.

Assembling the runbook

A playbook works only if it’s accessible and actionable when the alert fires at 2 AM. Here’s how to structure it:

Tier 1 — automated checks (every pipeline cycle):
– Post-write row count validation in every Spark notebook
– Canary row freshness for every mirrored source
_delta_log timestamp scan across key tables

Tier 2 — daily health checks (scheduled monitoring job):
– Shortcut validation: read one row from every shortcut target
– Capacity CU trending: alert if 7-day rolling average exceeds 70%
– Mirroring table-level lag report (not just aggregate status)

Tier 3 — incident response (when alerts fire):
– Start with capacity metrics. If throttling is active, it’s often the shared root cause behind multi-symptom incidents.
– Check mirroring “Last synced” timestamps. Don’t rely on status icons alone.
– For Spark write issues, verify SQL endpoint sync state independently from the Delta table itself.
– For shortcut errors, test target identity access directly outside of Fabric.

Fabric gives you powerful primitives: Spark at scale, OneLake as a unified data layer, and mirroring that removes a lot of custom ingestion plumbing. With cross-service monitoring and a practical runbook, these patterns become manageable operational events instead of recurring surprises.

This post was written with help from anthropic/claude-opus-4-6

fabric-cicd Is Now Officially Supported — Here’s Your Production Deployment Checklist

Three days ago, Microsoft promoted fabric-cicd from community project to officially supported tool. That Python library your team has been running in a “we’re still figuring out our deployment process” sort of way now carries Microsoft’s name and their support commitment.

That shift matters in three concrete places. First, your compliance team can stop asking “is this thing even supported?” Second, you can open Microsoft support tickets when it breaks. Third, the roadmap is no longer a volunteer effort. Features will land faster. Bugs will get fixed on a schedule.

But here’s where most teams stall. They read the announcement, nod approvingly, and then do absolutely nothing different. The notebook still gets deployed by clicking sync in the browser. The lakehouse GUID is still hardcoded. The “production” workspace is still one bad merge away from serving yesterday’s dev code to the entire analytics team.

An announcement without an execution plan is just news. Let’s build the plan.

What Fabric-CICD Does (and Where It Stops)

Understand the boundaries before you reorganize your deployment story. fabric-cicd is a Python library. You give it a Git repository, a target workspace ID, and a list of item types. It reads the item definitions from the repo, resolves dependencies between them, applies parameter substitutions, and pushes everything to the workspace. It can also remove orphan items that exist in the workspace but no longer appear in your repo.

It supports 25 item types: Notebooks, SparkJobDefinitions, Environments, Lakehouses, DataPipelines, SemanticModels, Warehouses, and 18 others. Every deployment is a full deployment. No commit diffs, no incremental updates. The entire in-scope state gets pushed every time.

Where it stops: it won’t manage your Spark compute sizing, it won’t migrate lakehouse data between environments, and it won’t coordinate multi-workspace transactions atomically. Those gaps are yours to fill. That’s not a weakness. A tool that owns its scope and does it well beats one that covers everything and nails nothing.

Prerequisite Zero: Get Your Git House in Order

This is the part that takes longer than anyone budgets for.

fabric-cicd reads from a Git repository. If your Fabric workspace isn’t connected to one, the tool has nothing to deploy. And plenty of Spark teams are still running workspaces where notebooks were born in the browser, edited in the browser, and will die in the browser without ever touching version control.

Connect your workspace to Azure DevOps or GitHub through Fabric’s Git Integration. Every notebook, every Spark job definition, every environment configuration goes into source control. All of it.

If your repo currently contains items named notebook_v2_final_FINAL_USE_THIS_ONE — and honestly, most of us have been there — now’s the time to clean that up before automating. Automating a disorganized repo just moves the disorganization faster. Getting the foundation right first saves real time down the road.

Your target state when this prerequisite is done: a main branch that mirrors production, feature branches for development work, and a merge strategy the whole team agrees on. fabric-cicd reads from a directory on disk. What it reads needs to be coherent.

The Parameter File: The Single Most Important Artifact

The parameter.yml file is where fabric-cicd learns the difference between your dev environment and production. Without it, you’re deploying identical configurations everywhere, which means your production notebooks will happily point at your dev lakehouse.

For Spark teams, four categories of parameter entries matter:

Default Lakehouse IDs. Every notebook binds to a lakehouse by GUID. In dev, that GUID points to your sandbox with test data. In production, it points to the lakehouse with three months of curated, retention-managed data. The parameter file swaps those GUIDs at deploy time. Miss one, and your production job reads from a lakehouse that got wiped last Tuesday.

Default Lakehouse Workspace IDs. If your production lakehouse lives in a separate workspace from dev (and it should), this mapping covers that scope. Lakehouse GUIDs alone aren’t enough when workspaces differ between environments.

Connection strings. Any notebook that pulls from an external data source needs environment-specific connection details. Hardcoded connection strings are how you end up running your production Spark cluster against a dev SQL database. That kind of mismatch can get expensive quickly — and it’s entirely preventable with proper parameterization.

Notebook parameter cells. Fabric lets you define parameter cells in notebooks. Every value that changes between environments belongs there, referenced by parameter.yml. Not in a comment. Not in a variable halfway down the notebook. In the parameter cell, where the tooling can find it.

The mechanism is find-and-replace. fabric-cicd scans your repository files for specific strings and swaps in the values for the target environment. This means the GUIDs in your repo must be consistent. If someone manually edited a lakehouse ID through the browser after a sync, the parameter file won’t catch the mismatch. Deployments will succeed. The notebook will fail. Those are the worst kind of bugs: silent ones.

Build Your Pipeline in Four Stages

Here’s a pipeline structure built for Spark teams, in the order things should execute:

Stage 1: Validate. Run your tests before anything deploys. If you have PySpark unit tests (even five of them), execute them against a local SparkSession or a lightweight Fabric environment. This catches broken imports, renamed functions, and bad type signatures. The goal isn’t 100% test coverage. The goal is catching the obvious failures before they reach a workspace anyone else depends on.

Stage 2: Build. Initialize the FabricWorkspace object with your target workspace ID, environment name, repository path, and scoped item types. For Spark teams, start with ["Notebook", "SparkJobDefinition", "Environment", "Lakehouse"]. Do not scope every item type on day one. Start with the items you deploy weekly. Expand scope after the first month, when you’ve seen how it behaves.

Stage 3: Deploy. Call publish_all_items(). The tool resolves dependency ordering, so if a notebook depends on a lakehouse that depends on an environment configuration, the sequence is handled. After publishing, call unpublish_all_orphan_items() to clean up workspace items that no longer appear in the repo. Skipping orphan cleanup means your workspace accumulates dead items that confuse the team and waste capacity.

Stage 4: Verify. This is the stage teams skip, and the one that saves them. After deployment, run a smoke test against the target workspace. Can the notebook open? Does it bind to the correct lakehouse? Can a lightweight execution complete without errors? A deployment that returns exit code zero but leaves notebooks pointing at a deleted lakehouse is not a successful deployment. Your pipeline shouldn’t treat it as one.

Guardrails Worth the Setup Cost

Guardrails turn a pipeline from a deployment mechanism into a safety net. These four are worth the setup time:

Approval gates. Require explicit human approval before any deployment to Production. fabric-cicd won’t enforce this for you. Wire it into your pipeline platform: Azure DevOps release gates, GitHub Actions environments with required reviewers. The first time a broken merge auto-deploys to production, you’ll wish you had spent the twenty minutes setting this up.

Service principal authentication. Run your pipeline under a service principal, not a user account. Give the principal workspace contributor access on the target workspace. Nothing more. When someone leaves the team or changes roles, deployments keep working because they never depended on that person’s credentials.

Tested rollback. Since fabric-cicd does full deployments from the repo, rollback means redeploying the last known-good commit. Conceptually clean. But “conceptually clean” doesn’t help you during an incident when stakeholders need answers fast. Test the rollback. Revert a deployment on a Tuesday afternoon when nothing is on fire. Confirm the workspace returns to its previous state. If you haven’t tested it, your rollback plan is still untested — and untested plans have a way of surprising you at the worst possible moment.

Deployment artifacts. Every pipeline run should log which items deployed, which parameters were substituted, and which orphans were removed. When production breaks and someone asks “what changed since yesterday?”, the answer should take thirty seconds, not three hours of comparing workspace states by hand.

Spark-Specific Problems Nobody Warns You About

General CI/CD guidance covers the broad strokes. Spark teams hit problems that live in the details:

Lakehouse bindings are buried in notebook content. The notebook-content.py file contains lakehouse and workspace GUIDs. If your parameter.yml misses even one of these, the production notebook opens to a “lakehouse not found” error. Audit every notebook, including the utility notebooks that other notebooks call with %run. Those hidden dependencies are where the bindings go wrong.

Environment items gate notebook execution. When your Spark notebooks depend on a custom Environment with specific Python libraries or Spark configuration properties, that Environment must exist in the target workspace before the notebooks arrive. The fabric-cicd dependency resolver handles this automatically, but only if Environment is in your item_type_in_scope. Scope just Notebook without Environment, and you’ll get clean deployments followed by runtime failures when the expected libraries don’t exist.

SparkJobDefinitions are not notebooks. SJDs carry executor counts, driver memory settings, reference files, and command-line arguments. All environment-specific values in these properties need coverage in your parameter file. Teams that parameterize their notebooks thoroughly and forget about their SJDs discover the gap when a production batch job runs with dev-sized executors and takes four times longer than expected.

Full deployment at scale needs scoping. Fifty notebooks deploy in minutes. Three hundred notebooks take longer and increase your blast radius. If your workspace has grown large, segment your repository by domain or narrow item_type_in_scope per pipeline to keep deployment times predictable and failures contained to a known set of items.

A Four-Week Migration Path

Starting from zero, here’s a timeline that’s aggressive but achievable:

Week 1: Git integration. Connect your workspace to source control. Rename items that need renaming. Agree on a branching strategy with the team. Write it down. Nothing deploys this week. This is foundation work, and skipping it makes everything after it harder.

Week 2: First deployment. Install fabric-cicd, write your initial parameter.yml, and run a deployment to a test workspace from the command line. Intentionally break the lakehouse binding in the parameter file. See what the error looks like. Fix it. Run it again. You want the team to recognize deployment failures before they encounter one under pressure.

Week 3: Pipeline construction. Build the CI/CD pipeline for Dev-to-Test promotion. Add approval gates, service principal auth, logging, and the verify stage. Run the pipeline ten times. Deliberately introduce a bad merge and watch the pipeline catch it. If it doesn’t catch it, fix the pipeline.

Week 4: Production extension. Extend the pipeline to include Production as a target. Add smoke tests. Test your rollback procedure. Write the runbook. Walk the team through it. Make sure at least two people can operate the pipeline without you in the room.

Four weeks. Not a quarter. Not a planning exercise that stalls in sprint three. A month of focused, methodical work that moves your Spark team from manual deployment to a process that runs the same way every time, whether it’s Tuesday at noon or Saturday at midnight.

The Real Takeaway

Microsoft giving fabric-cicd the official stamp means enterprise teams can stop hesitating. The library will get more attention, faster bug fixes, and broader item type support going forward.

But the tool is only half the story. A perfectly automated pipeline that deploys unparameterized notebooks to the wrong lakehouse is worse than manual deployment, because at least manual deployment forces someone to look at what they’re pushing. Automation works best when it’s built on a disciplined foundation — the checklist, the parameter file, the tested rollback, the verify stage.

Build the checklist. Work the checklist. Invest in the hard parts now, and they’ll pay you back in every deployment after.

This post was written with help from anthropic/claude-opus-4-6

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

There’s a connector that ships with every Fabric Spark runtime. It’s pre-installed. It requires no setup. And it lets your Spark notebooks read from—and write to—Fabric Data Warehouse tables as naturally as they read Delta tables from a Lakehouse.

Most Fabric Spark users don’t know it exists. The ones who do often run into the same three or four surprises. Let’s fix both problems.

What the connector actually is

The Spark connector for Fabric Data Warehouse (synapsesql) is a built-in extension to the Spark DataFrame API. It uses the TDS protocol to talk directly to the SQL engine behind your Warehouse (or the SQL analytics endpoint of a Lakehouse). You get read and write access to Warehouse tables from PySpark, Scala Spark, or Spark SQL.

One line of code to read:

from com.microsoft.spark.fabric.Constants import Constants

df = spark.read.synapsesql("my_warehouse.dbo.sales_fact")


One line to write:

df.write.mode("append").synapsesql("my_warehouse.dbo.sales_fact")


No connection strings. No passwords. No JDBC driver management. Authentication flows through Microsoft Entra—same identity you’re logged into your Fabric workspace with. The connector resolves the SQL endpoint automatically based on workspace context.

That’s the happy path. Now let’s talk about what actually happens when you use it.

Reading: the part that mostly just works

Reading from a Warehouse table into a Spark DataFrame is the connector’s strength. The synapsesql() call supports the full three-part naming convention: warehouse_name.schema_name.table_or_view_name. It works for tables and views, including views with joins across schemas.

A few things that are genuinely useful:

Predicate pushdown works. When you chain .filter() or .limit() onto your DataFrame, the connector pushes those constraints to the SQL engine. You’re not pulling the full table into Spark memory and then filtering—the SQL engine handles the filter and sends back the subset. This matters when your Warehouse tables have hundreds of millions of rows and you only need a time-sliced sample.

df = spark.read.synapsesql("my_warehouse.dbo.sales_fact") \
    .filter("order_date >= '2026-01-01'") \
    .select("order_id", "customer_id", "amount")


Cross-workspace reads work. If your Warehouse lives in a different workspace than your notebook’s attached Lakehouse, you pass the workspace ID:

df = spark.read \
    .option(Constants.WorkspaceId, "<target-workspace-id>") \
    .option(Constants.DatawarehouseId, "<warehouse-item-id>") \
    .synapsesql("my_warehouse.dbo.sales_fact")


This is genuinely powerful for hub-and-spoke architectures where your curated Warehouse sits in a production workspace and your data science notebooks live in a sandbox workspace.

Parallel reads are available. For large tables, you can partition the read across multiple Spark tasks, similar to spark.read.jdbc:

df = spark.read \
    .option("partitionColumn", "order_id") \
    .option("lowerBound", 1) \
    .option("upperBound", 10000000) \
    .option("numPartitions", 8) \
    .synapsesql("my_warehouse.dbo.sales_fact")


This splits the query into eight parallel reads, each fetching a range of order_id. Without this, you get a single-threaded read that will bottleneck on large tables.

Security models pass through. If your Warehouse has object-level security (OLS), row-level security (RLS), or column-level security (CLS), those policies are enforced when Spark reads the data. Your notebook sees exactly what your identity is authorized to see. This is a meaningful difference from reading Delta files directly via OneLake, where security operates at the workspace or folder level.

Custom T-SQL queries work too. You’re not limited to reading tables—you can pass arbitrary T-SQL:

df = spark.read \
    .option(Constants.DatabaseName, "my_warehouse") \
    .synapsesql("SELECT TOP 1000 * FROM dbo.sales_fact WHERE region = 'WEST'")


This is handy for complex aggregations or when you want the SQL engine to do the heavy lifting before data enters Spark.

Writing: the part with surprises

Write support for the Spark-to-Warehouse connector became generally available with Runtime 1.3. It works, and it solves a real architectural problem—but it has mechanics you need to understand.

How writes actually work under the hood

The connector uses a two-phase process:

  1. Stage: Spark writes your DataFrame to intermediate Parquet files in a staging location.
  2. Load: The connector issues a COPY INTO command, telling the Warehouse SQL engine to ingest from the staged files.

This is the same COPY INTO pattern that powers bulk ingestion into Fabric Data Warehouse generally. It’s optimized for throughput. It is not optimized for latency on small writes.

If you’re writing a DataFrame with 50 rows, the overhead of staging files and issuing COPY INTO means the write takes materially longer than you’d expect. For small, frequent writes, this connector is not the right tool. Use T-SQL INSERT statements through a SQL connection instead.

For batch writes of thousands to millions of rows, the connector performs well. The COPY INTO path is what the Warehouse was designed for.

Save modes

The connector supports four save modes:

  • errorifexists (default): Fails if the table already exists.
  • ignore: Silently skips the write if the table exists.
  • overwrite: Drops and recreates the table with new data.
  • append: Adds rows to the existing table.
df.write.mode("overwrite").synapsesql("my_warehouse.dbo.daily_aggregates")


A common pattern: Spark computes daily aggregations from Lakehouse Delta tables, then writes the results to a Warehouse table that Power BI reports connect to. The Warehouse’s result set caching (now generally available as of January 2026) means subsequent Power BI refreshes hit cache instead of recomputing.

The timestamp_ntz gotcha

This is the single most common error people hit when writing to a Warehouse from Spark.

If your DataFrame contains timestamp_ntz (timestamp without time zone) columns, the write will fail. Fabric Data Warehouse expects time-zone-aware timestamps. The fix is a cast before you write:

from pyspark.sql.functions import col

for c in df.columns:
    if dict(df.dtypes)[c] == "timestamp_ntz":
        df = df.withColumn(c, col(c).cast("timestamp"))

df.write.mode("append").synapsesql("my_warehouse.dbo.target_table")


This is not documented prominently enough. If you see a Py4JJavaError during write that mentions type conversion, timestamps are the first thing to check.

What you can’t write to

The connector writes to Warehouse tables only. You cannot write to the SQL analytics endpoint of a Lakehouse—it’s read-only. If you try, you’ll get an error. This seems obvious but trips people up because the same synapsesql() method handles both reads from Warehouses and Lakehouse SQL endpoints.

Private Link limitations

If Private Link is enabled at the workspace level, both read and write operations through the connector are unsupported. If Private Link is enabled at the tenant level only, writes are unsupported but reads still work. This is a significant limitation for security-conscious deployments. Check your network configuration before building pipelines that depend on this connector.

Time Travel is not supported

Fabric Data Warehouse now supports Time Travel queries. However, the Spark connector does not pass through Time Travel syntax. If you need to query a table as of a specific point in time, you’ll need to use a T-SQL connection directly rather than the synapsesql() method.

When to use Warehouse vs. Lakehouse as your serving layer

This is the architectural question that the connector’s existence forces you to answer. You’ve got data in your Lakehouse. Spark has transformed it. Now where does it go?

Use Lakehouse Delta tables when:

  • Your consumers are other Spark notebooks or Spark-based ML pipelines.
  • You need schema evolution flexibility (Delta’s schema merge).
  • You want to use OPTIMIZE, VACUUM, and Z-ORDER for table maintenance.
  • Your data scientists need direct file access through OneLake APIs.

Use Warehouse tables when:

  • Your primary consumers are Power BI reports or T-SQL analysts.
  • You need the Warehouse’s result set caching for repeated query patterns.
  • You need fine-grained security (RLS, CLS, OLS) that passes through to all consumers.
  • You want to use T-SQL stored procedures, views, and MERGE statements for downstream transformations.
  • You need cross-database queries that join Warehouse tables with Lakehouse tables or other Warehouse tables.

Use both when:

  • Spark processes and stores data in the Lakehouse (bronze → silver → gold medallion layers), then the connector writes final aggregations or serving tables to the Warehouse.
  • The Warehouse serves as the “last mile” between your data engineering work and your business intelligence layer.

The January 2026 GA of MERGE in Fabric Data Warehouse makes the “write to Warehouse” pattern significantly more useful. You can now do incremental upserts: Spark writes a staging table, then a T-SQL MERGE reconciles it with the target. This is a common pattern in data warehousing that was previously awkward in Fabric.

A concrete pattern: Spark ETL → Warehouse serving layer

Here’s the pattern I see working well in production:

# 1. Read from Lakehouse Delta tables (Spark native)
bronze = spark.read.format("delta").load("Tables/raw_orders")

# 2. Transform in Spark
silver = bronze.filter(col("status") != "cancelled") \
    .withColumn("order_date", col("order_ts").cast("date")) \
    .withColumn("amount_usd", col("amount") * col("fx_rate"))

gold = silver.groupBy("region", "order_date") \
    .agg(
        count("order_id").alias("order_count"),
        sum("amount_usd").alias("total_revenue")
    )

# 3. Write to Warehouse for Power BI consumption
gold.write.mode("overwrite").synapsesql("analytics_warehouse.dbo.daily_revenue")


The Lakehouse owns the raw and transformed data. Spark does the heavy compute. The Warehouse serves the final tables to downstream consumers with T-SQL access, caching, and fine-grained security.

The alternative—writing gold tables to the Lakehouse and having Power BI connect via the SQL analytics endpoint—also works. But the SQL analytics endpoint has a metadata sync delay after Spark writes new data. The Warehouse table is immediately consistent after the COPY INTO completes. If your reporting needs to reflect the latest pipeline run without a sync lag, the Warehouse path is more reliable.

Cross-database queries: the glue between them

Once you have data in both a Lakehouse and a Warehouse in the same workspace, you can query across them using T-SQL cross-database queries from the Warehouse:

SELECT w.customer_id, w.total_revenue, l.customer_segment
FROM analytics_warehouse.dbo.daily_revenue AS w
JOIN my_lakehouse.dbo.customer_dim AS l
    ON w.customer_id = l.customer_id


This means your Warehouse doesn’t need to contain all the data. It can hold the curated aggregations while joining against dimension tables that live in the Lakehouse. No data movement. No duplication. The SQL engine resolves both sources through OneLake.

Performance notes from the field

A few observations from real workloads:

Reads are faster than you expect. The TDS protocol connection to the Warehouse SQL engine is efficient. For typical analytical queries returning thousands to low millions of rows, the synapsesql() read is competitive with reading Delta files directly, especially when the Warehouse has statistics and result set caching enabled.

Writes are slower than Lakehouse writes. The two-phase staging + COPY INTO process adds overhead versus a direct df.write.format("delta").save() to Lakehouse tables. For a DataFrame with 10 million rows, expect the Warehouse write to take 2-5x longer than an equivalent Lakehouse Delta write. This is the tradeoff for getting immediate T-SQL access with full Warehouse capabilities.

Use parallel reads for large tables. The default single-partition read will bottleneck. Set numPartitions to match your Spark cluster’s available cores for large reads. The performance improvement is often 4-8x.

Proactive and incremental statistics refresh. As of January 2026, Fabric Data Warehouse supports proactive statistics refresh and incremental statistics. This means the query optimizer keeps statistics up to date automatically. Your synapsesql() reads benefit from better query plans without manual UPDATE STATISTICS calls.

The honest summary

The Spark connector for Fabric Data Warehouse is a well-designed bridge between two systems that many teams use side by side. It makes the read path simple and the write path possible without leaving your Spark notebook.

It is not a replacement for writing to Lakehouse Delta tables. It is an additional output path for when your downstream consumers need T-SQL, fine-grained security, result set caching, or immediate consistency. Use it when the Warehouse is the right serving layer. Don’t use it when Lakehouse is sufficient.

The biggest wins come from combining both: Spark for compute, Lakehouse for storage, Warehouse for serving. The connector is the plumbing that makes that architecture work without data pipelines in between.

If you’re heading to FabCon Atlanta (March 16-20, 2026), both the Data Warehouse and Data Engineering teams will be there. It’s a good place to pressure-test your architecture and see what’s coming next.


This post was written with help from anthropic/claude-opus-4-6

Fabric Spark billing just got clearer. Here’s how to make the most of it.

Somewhere in a shared Teams channel, a Fabric capacity admin is looking at the Capacity Metrics app and noticing Spark consumption is down 15% overnight. Same notebooks. Same schedules. Same engineers shipping code with the same amount of caffeine.

A quick thread later, the answer is clear: nothing is wrong. Microsoft introduced new billing operations, and AI usage is now visible in its own category.

That’s not a cost increase. That’s better instrumentation.

What actually changed

On February 13, 2026, Microsoft announced two new billing operations for Fabric: AI Functions and AI Services.

Previously, AI-related usage in notebooks was grouped under Spark operations. Calls made through fabric.functions, Azure OpenAI REST API, the Python SDK, and SynapseML were all reported in Spark. Text Analytics and Azure AI Translator calls from notebooks were also reflected there.

Now those costs are separated:

  • AI Functions covers Fabric AI function calls and Azure OpenAI Service usage in notebooks and Dataflows Gen2.
  • AI Services covers Text Analytics and Azure AI Translator usage from notebooks.

Both are billed under the Copilot and AI Capacity Usage CU meter.

Important: consumption rates did not change. You pay the same for the same work. What changed is visibility.

Why this reporting update is a win for operators

If you’ve ever tried to explain Spark trends that include hidden AI consumption, this update helps immediately.

Picture an F64 capacity. You historically allocated 70% of CU budget to Spark because that’s what Capacity Metrics showed. But Spark previously included AI consumption, so the category was doing two jobs at once.

Now Spark and AI can each tell their own story. That’s useful for:

  • more accurate workload attribution
  • cleaner alerting by operation type
  • better planning conversations with finance and platform teams

In other words: same total spend, sharper signal.

The migration checklist

There’s nothing to deploy and no code changes required. The opportunity is operational: update your monitoring and planning so you can benefit from the new detail right away.

1. Audit your AI function usage

Before the new operations appear in your Metrics app, find AI calls in your codebase. Search notebooks for:

  • fabric.functions calls
  • Azure OpenAI REST API calls (look for /openai/deployments/)
  • openai Python SDK usage within Fabric notebooks
  • SynapseML OpenAI transformers
  • Text Analytics API calls
  • Azure AI Translator calls

If there are no hits, this billing split likely won’t affect your current workloads. If there are many hits (common in mature notebook estates), estimate volume now so your post-change analysis is faster.

2. Baseline your current Spark consumption

Export the last 30 days of Capacity Metrics data for Spark operations and save it.

This is your before-state. After rollout, validate that total consumption (Spark + new AI operations) aligns with historical Spark totals. If it aligns, you’ve confirmed a reporting change. If not, you have a clear starting point for investigation.

3. Adjust your alerting thresholds

If you monitor Spark CU consumption via Capacity Metrics, Azure Monitor, or custom API polling, update thresholds after the split.

Recommended approach:

  • take your current Spark threshold
  • subtract estimated AI consumption from step 1
  • set that as the revised Spark threshold
  • add a separate alert for the Copilot and AI meter

If AI estimates are still rough, start with a conservative threshold and tune after a few weeks of separated data.

4. Update your capacity planning models

Add a dedicated row for AI consumption in any spreadsheet, Power BI report, or planning document that allocates CU budget by operation type.

The Copilot and AI Capacity Usage CU meter already existed for Copilot scenarios, but this may be the first time many Spark-first teams see meaningful workload usage there. Adding it now makes future reviews easier.

5. Set up a validation window

Choose a date after March 17 (when the new operations start appearing) and compare pre/post totals:

  • pre-change: Spark total
  • post-change: Spark + AI Functions + AI Services

Expect close alignment (allowing for normal workload variation and rounding). If variance is more than a few percent, open a support ticket. Microsoft described this as a reporting-only change with no rate modifications.

6. Share a quick team note before questions start

One short update prevents a lot of confusion:

“Microsoft is separating AI consumption from Spark billing into dedicated operations. Total cost is unchanged. Spark will appear lower, and Copilot and AI will appear higher. This improves visibility and tracking.”

That gives engineers context and helps finance teams interpret new categories correctly on day one.

Post-rollout checks that keep things clean

Consumption variance check. If post-change totals (Spark + AI Functions + AI Services) differ significantly from pre-change Spark trends, compare equivalent workload windows and rule out schedule, code, or capacity changes.

Expected operation visibility. If you confirmed AI usage in step 1 but AI Functions shows zero, check regional rollout timing from the Fabric blog before escalating.

Why separated AI spend is valuable

This platform-side categorization update gives teams a better lens on where capacity is being used.

Once AI usage is measurable independently, you can answer higher-quality questions:

  • Which AI workflows are creating the most value per CU?
  • Which calls are production-critical versus experimental leftovers?
  • Where should you optimize first for performance and cost?

That is exactly the kind of visibility mature platform teams want.

What this signals about Fabric billing

As Fabric workloads evolve, billing categories will continue to become more descriptive. That’s a good thing. Better category design means better operational decisions.

The admin in that Teams thread got clarity quickly: Spark wasn’t shrinking, observability was improving. Once the team updated dashboards and alerts, they had a more useful capacity model than they had the week before.

That’s the real upgrade here.


This post was written with help from anthropic/claude-opus-4-6