What “Upgrade your Synapse pipelines to Microsoft Fabric with confidence (Preview)” actually means for Fabric Spark teams in production

What "Upgrade your Synapse pipelines to Microsoft Fabric with confidence (Preview)" actually means for Fabric Spark teams in production

Preview posts are written to soothe. Production teams read them like incident reviewers. They want to know what moves, what stays off, and what still needs proof before anyone re-enables a trigger.

This new migration experience is useful because it has brakes.

It lets you assess Synapse pipelines, see compatibility gaps, migrate supported pipelines into a Fabric workspace, map Synapse linked services to Fabric connections, and keep execution under control while you validate the result. That is not a one-click estate conversion. Good. One-click migration promises are how people end up explaining themselves on a call at 6 a.m.

This is triage before it is migration

The flow is split into three stages: assessment, review, and migration.

Assessment classifies each pipeline as Ready, Needs review, Coming soon, or Unsupported / Not compatible. You can export the assessment to CSV, which is more useful than it sounds. Most Synapse estates are not clean enough to reason about from memory. The CSV gives you a working list you can sort, assign, and use in a real plan.

The categories also give you an obvious first pass:

Ready: pilot batch.
Needs review: engineering work.
Coming soon: stop thrashing and wait for support to land.
Unsupported / Not compatible: redesign it.

The docs also recommend a phased approach. Start with Ready. Fix Needs review. Rerun the assessment. Sensible advice, which means some teams will try very hard to ignore it.

The Spark-specific catch is the part people will miss

If a Synapse pipeline calls Notebook activities or Spark job definition activities, Microsoft says to migrate those Spark artifacts to Fabric first.

That is the whole game for Spark teams.

If the matching Fabric notebooks or Spark job definitions already exist, the migration flow can map those activities to the Fabric items. If they do not exist yet, those activities may stay unmapped or deactivated until you create the Fabric items and update the references.

So a migrated pipeline is not automatically a runnable Spark workload. It may be a correctly copied orchestration layer that still points to nowhere useful. If your team blurs that line, you are not “almost done.” You are halfway to a very dumb cutover.

Connection mapping is where “migrated” stops meaning “ready”

The migration flow then asks you to pick a Fabric workspace and map Synapse linked services to Fabric connections.

Here the product does something smart. It does not force fake completeness. Pipelines can migrate even if every connection is not mapped. The catch is explicit: activities that use unmapped connections remain deactivated.

That is the right tradeoff. A deactivated activity is annoying. A silently broken run is worse.

This is where the human work starts:

make sure the right Fabric connections exist
validate credentials and access
check which activities are still deactivated
confirm notebook and Spark job references point to the intended Fabric items

The tool can move metadata. It cannot tell you whether your team has actually finished the migration.

“Triggers disabled by default” is the best sentence in the whole thing

After migration, triggers are disabled by default.

Perfect.

That removes one of the most common migration failure modes: an artifact gets copied, a dependency gets missed, the schedule fires anyway, and now production is teaching everyone a lesson. Keeping triggers off buys you a clean validation window.

The post-migration guidance is refreshingly sane:

Validate connections and credentials.
Re-enable and configure triggers as needed.
Run end-to-end tests.
Validate in a nonproduction environment before switching production workloads.

That is the order. Not the other way around.

There is one smaller operational detail worth noting. Migrated pipelines appear in the Fabric workspace with the source factory name prefixed. That helps when you are reviewing a mixed estate and trying to keep lineage straight.

What this preview changes

It does not finish the migration for you. It does make the early part less chaotic.

You get a readiness assessment instead of guesswork. You get a phased path instead of a big-bang leap. You get visible connection mapping. You get deactivated activities when dependencies are missing. You get triggers held back until you choose to turn them on.

That is real value. It turns migration from “hope plus calendar pressure” into something you can audit.

A rollout pattern worth trusting

If I were running this for a production Fabric Spark estate, I would keep it brutally simple.

Migrate notebooks and Spark job definitions to Fabric first.
Run the pipeline assessment and export the CSV.
Start with Ready pipelines that already have their Fabric Spark counterparts in place.
Map linked services to Fabric connections and treat every deactivated activity as unfinished work.
Run end-to-end tests in nonproduction. Compare outputs, parameters, logging, and failure handling.
Re-enable triggers only after the pipeline and its Spark dependencies survive contact with reality.
Then work through the Needs review backlog and rerun assessment as you clear items.

It is not glamorous. It is how you keep a migration from turning into a weekly apology.

The practical takeaway

This preview matters because it is honest about the order of operations.

For Spark-heavy Synapse estates, the job is not “move everything to Fabric.” The job is “move Spark artifacts first, move orchestration second, validate connections and behavior, then turn execution back on.” The new experience supports that sequence instead of pretending the sequence does not matter.

So no, this is not a teleportation device for legacy pipelines. It is a staging area with guardrails. For teams running Spark in production, that is much more useful.

This post was written with help from anthropic/claude-opus-4-6

What the February 2026 gateway release really means for Fabric Spark teams

Monthly gateway release posts are usually the corporate equivalent of dry toast. A version number appears. Power BI Desktop compatibility gets a polite bow. Then everyone goes back to moving data and arguing with refresh logs.

The February 2026 on-premises data gateway release is mostly that kind of update. Microsoft says the build is 3000.306, and the point is simple: keep the gateway aligned with the February 2026 Power BI Desktop release so reports refreshed through the gateway use the same query execution logic and runtime as Desktop.

Useful? Yes. Dramatic? Not even a little.

What makes this release worth a Spark team’s time is everything happening around it. In the last few months, Microsoft added manual gateway updates, shipped pipeline performance work in January, and expanded managed private endpoint guidance for Fabric Data Engineering workloads. Put together, those changes tell a clearer story than the February post does on its own: the gateway still matters, but it is no longer background plumbing you patch whenever someone remembers.

The February release itself is small

The official February announcement is short and very Power BI flavored. Version 3000.306 brings the gateway up to date with the February 2026 Power BI Desktop release. That matters if your Spark world touches gateway-mediated refresh or movement of data through Fabric services that depend on the gateway.

If your team uses Spark notebooks or Spark job definitions alongside pipelines, semantic models, or refresh paths that still run through the on-premises data gateway, version alignment is not glamorous, but it is part of keeping production boring. And boring is what you want from production. “Interesting” is how incident reviews begin.

There is also an awkward timing detail here. The Microsoft Learn page for supported gateway versions already lists March 2026, build 3000.310, as the latest supported update. So if you are making an upgrade decision today, the practical move is not to cling to 3000.306 out of loyalty to February. The real lesson from February is that the monthly update train keeps moving, and Spark teams need an operating habit for that cadence.

December changed the maintenance story

The bigger operational shift arrived in the December 2025 release, build 3000.298. That release introduced Manual Update for On-premises Data Gateway in preview. Microsoft says admins can trigger updates from the gateway UI or programmatically through API or script, and the related documentation shows the PowerShell path with Update-DataGatewayClusterMember.

That may sound like a small administrative nicety. It is not. It is the difference between “we update the gateway when someone notices” and “we update the gateway during a planned window, on purpose, with a record of what happened.”

Microsoft’s update documentation is blunt about why this matters in clusters. When gateway members run different versions, you can get sporadic failures because one member can handle a query that another cannot. The guidance is to disable one member, let the work drain, update it, re-enable it, and repeat for the rest of the cluster. That is not fancy advice. It is good advice. Production systems usually break in ordinary, irritating ways.

Two details matter:

The November 2025 release is the baseline for the manual update feature.
Microsoft says the updater service activates only when an update is triggered from the UI or via PowerShell.

In other words, December did not add one more button. It added a more controlled update path for teams that have to care about maintenance windows, change management, and not getting yelled at on a Friday night.

January made the gateway more relevant to pipeline-heavy Spark teams

The January 2026 release, build 3000.302, was modest on paper but more interesting in practice. Microsoft called out two improvements:

Performance optimization for reading CSV format in Copy job and Pipeline activities
Performance optimization for read and write through adaptive performance tuning capability in Pipeline

That is not a fireworks show, but it is more concrete than the average release note. If your Fabric Spark workflow begins with Copy jobs or Pipeline activities that pull CSV-shaped data before Spark takes over, January was the sort of release you should benchmark instead of shrugging at.

Notice what Microsoft did not say: there is no grand promise that everything is suddenly twice as fast and angels now sing over your lakehouse. Fine. Release notes rarely sing. Still, when a gateway sits in front of repetitive ingestion work, even a dull-sounding optimization can shave time off every run. Boring improvements are often the ones that pay rent.

Spark teams now have a second route for on-premises access

The most interesting shift is not in the gateway release notes at all. It is in Fabric’s managed private endpoint work for Data Engineering workloads.

Microsoft’s October 2025 Fabric blog post says Managed Private Endpoints support for connecting to Private Link Services became available through the Fabric Public REST APIs, specifically to help Fabric Spark compute reach on-premises and network-isolated data sources. The newer Learn guidance goes further: Fabric workloads such as Spark or Data Pipelines can connect to on-premises or custom-hosted sources through an approved Private Link setup, with traffic flowing through the Microsoft backbone network rather than the public internet.

That is a real architectural fork in the road.

If your team has treated the on-premises data gateway as the default answer to any sentence containing the words “on-premises” and “Fabric,” that default deserves another look. The managed private endpoint docs say that, once approved, Fabric Data Engineering workloads such as notebooks, Spark job definitions, materialized lakeviews, and Livy endpoints can securely connect to the approved resource.

That does not kill the gateway. It does mean the gateway is no longer the only respectable adult in the room.

There is also one gotcha that will ambush people who like clicking around until things work. Microsoft says creating a managed private endpoint with a fully qualified domain name through Private Link Service is supported only through the REST API, not the UX. So if your plan is “we’ll set it up later in the portal,” later may arrive carrying disappointment.

What a Fabric Spark team should do next

If I were cleaning this up for a real production team, the to-do list would look like this:

Check the supported monthly updates page before touching anything. As of late March 2026, it already lists March 2026, build 3000.310, as the newest supported gateway release.
If you run a gateway cluster, stop tolerating version drift. Follow Microsoft’s member-by-member update guidance so one node does not become the office goblin that fails queries the others can run.
If you want controlled upgrades, confirm your gateways are on the November 2025 baseline or later, then script manual updates with Update-DataGatewayClusterMember.
Inventory which Spark-adjacent workloads really need the gateway and which ones are gateway-shaped only because nobody revisited the design.
For Spark or Data Pipeline scenarios that need private access to on-premises or custom-hosted sources, evaluate managed private endpoints and Private Link Service instead of assuming the gateway must stay in the middle.
If your ingestion path leans on CSV through Copy jobs or Pipeline activities, test the January build improvements against your actual workloads rather than trusting vague optimism.

One more limitation matters here. The managed private endpoint overview says the feature depends on Fabric Data Engineering workload support in both the tenant home region and the capacity region. So before anyone gives a triumphant architecture presentation, check whether your region setup actually supports what you plan to do.

The short version

The February 2026 gateway release is a small compatibility release. On its own, it would barely justify a coffee break. For Fabric Spark teams, though, it lands in the middle of a more meaningful change.

Gateway maintenance is becoming easier to control. Pipeline-oriented gateway work picked up performance tuning in January. And Spark workloads now have a documented private-connectivity path that can bypass the old habit of stuffing every on-premises access pattern through the gateway.

So no, February 2026 was not a blockbuster. It was a signpost. The smart move is to stop treating the gateway as an untouchable default, update it like you mean it, and decide workload by workload whether Spark still needs that middleman.

If you want the raw source material rather than anyone’s interpretation, start here:

This post was written with help from anthropic/claude-opus-4-6

What “Recent data” in Fabric means for Spark teams when time is the real bottleneck

At 8:07 a.m., nobody on a data engineering team is debating architecture purity. You’re trying to get back to the exact source you were fixing yesterday before another downstream notebook fails and somebody asks for an ETA.

That’s the problem Microsoft Fabric’s Recent data feature targets.

The feature landed in the February 2026 Fabric update and is currently in preview. It sounds small: Dataflow Gen2 remembers the specific items you used recently — tables, files, folders, databases, and sheets — and lets you load them directly into the editing canvas. For Spark-heavy teams, though, this is less of a UX tweak and more of a way to stop bleeding time in the first mile of work.

And yes, it’s still a preview feature. Treat it like a mountain route in unstable weather: useful, fast, and not something you trust blindly.

Why Spark teams should care about a Dataflow feature

A lot of Spark teams still frame Dataflow Gen2 as somebody else’s tool. That framing is outdated.

Dataflow Gen2 automatically creates staging Lakehouse and Warehouse items in your workspace. If your team’s workflow includes Dataflow-based ingestion and Spark-based transformation, the handoff between those steps is real. It’s your daily route.

Here’s the hard lesson: if your ingestion layer touches Dataflow Gen2, then UI friction inside Dataflow is your Spark team’s problem too.

What to do about it:

Write down your ingestion handoffs in plain language: source to Dataflow Gen2 to staging Lakehouse/Warehouse to Spark notebooks.
Mark where engineers repeatedly reconnect to the same sources. That’s where Recent data pays off first.

What Recent data changes under pressure

Recent data does one thing that matters: it remembers specific assets, not just abstract connections.

When you return to a fix, you’re not restarting the expedition from base camp. You get dropped closer to the problem. You can pull the item directly into the editing canvas and keep moving.

For teams, this changes the rhythm of incident response and iteration:

You get back to source-level corrections faster.
You reduce the chance that someone reconnects to the wrong similarly-named object while moving too fast.
You spend less team energy on navigation and more on data correctness.

None of this is glamorous. It’s also exactly where engineering throughput gets won.

Try this: during your next defect cycle, track one metric for a week — time from “issue found” to “source query/table reopened in Dataflow Gen2.” If that number drops after using Recent data, keep leaning in. If it doesn’t, your bottleneck is elsewhere.

What this feature doesn’t rescue you from

Teams love to over-credit new features. Recent data is a navigation accelerator. It’s not governance. It’s not validation. It’s not a replacement for naming discipline. And because it’s in preview, it’s not a foundation for critical operational assumptions.

If your source naming is chaotic, Recent data will surface chaos faster.

If your validation is weak, Recent data will help you ship mistakes sooner.

If your runbooks are vague, Recent data won’t magically teach new engineers what “correct” looks like.

Pair it with a minimum Spark validation pass after ingestion updates: schema check, null expectation, row-count sanity check. Keep this lightweight and repeatable. The point is fast feedback, not ceremony.

Preview discipline: run this like a survival checklist

Because Recent data is in preview, your team should operate with explicit guardrails.

Test in development first. Don’t roll workflow assumptions into production muscle memory before your team has used the feature in real edits.

Keep a source-of-truth map. Recent data is convenience. Your documented source map is control. Keep both.

Standardize names now. If a human can confuse two source objects at a glance, they will. Fix names before speed amplifies mistakes.

Define a fallback path. If the recent list doesn’t have what you need, nobody should improvise. Document the manual reconnect path and keep it current.

Review preview behavior monthly. If the feature behavior shifts while in preview, your team should notice fast and adjust intentionally. Assign one owner for “preview watch” each month. Their job: test the core flow, confirm assumptions still hold, alert the team if anything drifts.

The operating model for Spark leads

If you lead a Spark data engineering team, the decision is straightforward.

Use Recent data. Absolutely use it. But use it like a rope, not like wings.

A rope gets you through rough terrain faster when the team is clipped in, communicating, and following route discipline. Wings are what people imagine they have right before they step into empty air.

In practice:

Adopt the feature for speed.
Keep your documentation for continuity.
Keep naming conventions strict for safety.
Keep Spark-side validation for quality.
Treat preview status as a real risk signal, not legal fine print.

That combination is where this feature becomes meaningful. Not because it’s flashy. Because it removes repeated friction at exactly the point where your team loses focus, burns time, and compounds small mistakes.

In data engineering, the catastrophic failures usually start as tiny oversights repeated at scale. Recent data removes one class of those oversights — the constant re-navigation tax — but only if you wrap it in disciplined operating habits.

One less avoidable stumble on steep ground, so your team can spend its strength on the parts of the climb that actually require judgment.

This post was written with help from anthropic/claude-opus-4-6

What “Execute Power Query Programmatically” Means for Fabric Spark Teams

Somewhere in a Fabric workspace right now, two teams are maintaining the same transformation twice.

The BI team owns it in Power Query. The Spark team rewrote it in PySpark so a notebook could run it on demand. Both versions work. Both versions drift. Both versions break at different times.

That was normal.

Microsoft’s new Execute Query API (preview) is the first real shot at ending that duplication. It lets you execute Power Query (M) through a public REST API from notebooks, pipelines, or any HTTP client, then stream results back in Apache Arrow format.

For Spark teams, this isn’t a minor feature. It changes where transformation logic can live.

What actually shipped

At a technical level, the API is simple:

Endpoint: POST /v1/workspaces/{workspaceId}/dataflows/{dataflowId}/executeQuery
Input: a queryName, with optional customMashupDocument (full M script)
Output: Arrow stream (application/vnd.apache.arrow.stream)

The execution context comes from a Dataflow Gen2 artifact in your workspace. Its configured connections determine what data sources the query can access and which credentials are used.

That single detail matters more than it looks. You’re not just “calling M from Spark.” You’re running M under dataflow-governed connectivity and permissions.

Why Spark engineers should care

Before this API, Spark teams usually had two options:

Rewrite M logic in PySpark
Or wait for a dataflow refresh and consume the output later

Neither is great. Rewrites create long-term maintenance debt. Refresh handoffs add latency and orchestration fragility.

Now you can execute the transformation inline and keep moving.

A minimal call path looks like this:

import requests import pyarrow as pa  response = requests.post(url, headers=headers, json=request_body, stream=True)  with pa.ipc.open_stream(response.raw) as reader:     pandas_df = reader.read_pandas()  spark_df = spark.createDataFrame(pandas_df)

No CSV hop. No JSON schema drift. No custom parsing layer.

The non-negotiable constraints

This feature is useful, but it is not magic. There are hard boundaries.

90-second timeout
– Query evaluations must complete within 90 seconds.
– This is ideal for fast lookups, enrichment, and reference joins—not heavy batch reshaping.
Read-only execution
– The API executes queries only. It doesn’t support write actions.
– If your notebook flow assumes “query + write” in one API step, redesign it.
Native query rule for custom mashups
– customMashupDocument does not allow native database queries.
– But if a query defined inside the dataflow itself uses native queries, that query can be executed.
– This distinction will trip people if they treat inline M and stored dataflow queries as equivalent.
Performance depends on folding and query complexity
– Bad folding or expensive transformations can burn your 90-second window quickly.
– You need folding-aware query reviews before production rollout.

Practical rollout plan for Spark teams

If you lead a Fabric Spark team, do this in order.

1) Inventory duplication first

Build a short list of transformations currently duplicated between M and PySpark. Start with transformations that are stable, reused often, and mostly read-oriented.

2) Stand up a dedicated execution dataflow

Create one Dataflow Gen2 artifact specifically for API-backed execution contexts.

Keep connections explicit and reviewed
Restrict who can modify those connections
Treat the artifact like infrastructure, not ad hoc workspace clutter

3) Wrap Execute Query behind one notebook utility

Don’t let every notebook hand-roll HTTP logic. Create one shared helper that handles:

token acquisition
request construction
Arrow stream parsing
error handling
timeout/response logging

If the API returns 202 (long-running operation), honor Location and Retry-After instead of guessing polling behavior.

4) Add governance checks before scale

Because execution runs under dataflow connection scope, validate:

who can execute
what connections they indirectly inherit
which data sources become reachable through that path

If your governance model assumes notebook identity is the only control plane, this API changes that assumption.

5) Monitor capacity from day one

Microsoft surfaces this usage in Capacity Metrics as “Dataflows Gen2 Run Query API”, billed on the same meter family as Dataflow Gen2 refresh operations. Watch this early so you don’t discover new spend after adoption is already wide.

Where this fits (and where it doesn’t)

Use it when you need:

shared transformation logic between BI and engineering
fast, read-oriented query execution from Spark/pipelines/apps
connector and gateway reach already configured in dataflows

Avoid it when you need:

long-running transformations
write-heavy jobs
mission-critical production paths with zero preview risk tolerance

The REST API docs still mark this as preview and “not recommended for production use.” Treat that warning as real, not ceremonial.

The organizational shift hiding behind the API

The technical win is straightforward: fewer rewrites, faster integration, cleaner data handoffs.

The harder change is social.

When Spark notebooks can directly execute M, ownership lines between BI and data engineering need to be explicit. Who owns business logic? Who owns runtime reliability? Who approves connection scope?

Teams that answer those questions early will move fast.

Teams that don’t will just reinvent the same duplication problem with a new endpoint.

Source notes

Microsoft Fabric Blog: Evaluate Power Query Programmatically in Microsoft Fabric (Preview)
https://blog.fabric.microsoft.com/en-US/blog/execute-power-query-programmatically-in-microsoft-fabric/
Microsoft Learn REST API: Query Execution – Execute Query
https://learn.microsoft.com/en-us/rest/api/fabric/dataflow/query-execution/execute-query
Microsoft Learn: Pricing for Dataflow Gen2
https://learn.microsoft.com/en-us/fabric/data-factory/pricing-dataflows-gen2

This post was written with help from anthropic/claude-opus-4-6.

fabric-cicd Is Now Officially Supported — Here’s Your Production Deployment Checklist

Three days ago, Microsoft promoted fabric-cicd from community project to officially supported tool. That Python library your team has been running in a “we’re still figuring out our deployment process” sort of way now carries Microsoft’s name and their support commitment.

That shift matters in three concrete places. First, your compliance team can stop asking “is this thing even supported?” Second, you can open Microsoft support tickets when it breaks. Third, the roadmap is no longer a volunteer effort. Features will land faster. Bugs will get fixed on a schedule.

But here’s where most teams stall. They read the announcement, nod approvingly, and then do absolutely nothing different. The notebook still gets deployed by clicking sync in the browser. The lakehouse GUID is still hardcoded. The “production” workspace is still one bad merge away from serving yesterday’s dev code to the entire analytics team.

An announcement without an execution plan is just news. Let’s build the plan.

What Fabric-CICD Does (and Where It Stops)

Understand the boundaries before you reorganize your deployment story. fabric-cicd is a Python library. You give it a Git repository, a target workspace ID, and a list of item types. It reads the item definitions from the repo, resolves dependencies between them, applies parameter substitutions, and pushes everything to the workspace. It can also remove orphan items that exist in the workspace but no longer appear in your repo.

It supports 25 item types: Notebooks, SparkJobDefinitions, Environments, Lakehouses, DataPipelines, SemanticModels, Warehouses, and 18 others. Every deployment is a full deployment. No commit diffs, no incremental updates. The entire in-scope state gets pushed every time.

Where it stops: it won’t manage your Spark compute sizing, it won’t migrate lakehouse data between environments, and it won’t coordinate multi-workspace transactions atomically. Those gaps are yours to fill. That’s not a weakness. A tool that owns its scope and does it well beats one that covers everything and nails nothing.

Prerequisite Zero: Get Your Git House in Order

This is the part that takes longer than anyone budgets for.

fabric-cicd reads from a Git repository. If your Fabric workspace isn’t connected to one, the tool has nothing to deploy. And plenty of Spark teams are still running workspaces where notebooks were born in the browser, edited in the browser, and will die in the browser without ever touching version control.

Connect your workspace to Azure DevOps or GitHub through Fabric’s Git Integration. Every notebook, every Spark job definition, every environment configuration goes into source control. All of it.

If your repo currently contains items named notebook_v2_final_FINAL_USE_THIS_ONE — and honestly, most of us have been there — now’s the time to clean that up before automating. Automating a disorganized repo just moves the disorganization faster. Getting the foundation right first saves real time down the road.

Your target state when this prerequisite is done: a main branch that mirrors production, feature branches for development work, and a merge strategy the whole team agrees on. fabric-cicd reads from a directory on disk. What it reads needs to be coherent.

The Parameter File: The Single Most Important Artifact

The parameter.yml file is where fabric-cicd learns the difference between your dev environment and production. Without it, you’re deploying identical configurations everywhere, which means your production notebooks will happily point at your dev lakehouse.

For Spark teams, four categories of parameter entries matter:

Default Lakehouse IDs. Every notebook binds to a lakehouse by GUID. In dev, that GUID points to your sandbox with test data. In production, it points to the lakehouse with three months of curated, retention-managed data. The parameter file swaps those GUIDs at deploy time. Miss one, and your production job reads from a lakehouse that got wiped last Tuesday.

Default Lakehouse Workspace IDs. If your production lakehouse lives in a separate workspace from dev (and it should), this mapping covers that scope. Lakehouse GUIDs alone aren’t enough when workspaces differ between environments.

Connection strings. Any notebook that pulls from an external data source needs environment-specific connection details. Hardcoded connection strings are how you end up running your production Spark cluster against a dev SQL database. That kind of mismatch can get expensive quickly — and it’s entirely preventable with proper parameterization.

Notebook parameter cells. Fabric lets you define parameter cells in notebooks. Every value that changes between environments belongs there, referenced by parameter.yml. Not in a comment. Not in a variable halfway down the notebook. In the parameter cell, where the tooling can find it.

The mechanism is find-and-replace. fabric-cicd scans your repository files for specific strings and swaps in the values for the target environment. This means the GUIDs in your repo must be consistent. If someone manually edited a lakehouse ID through the browser after a sync, the parameter file won’t catch the mismatch. Deployments will succeed. The notebook will fail. Those are the worst kind of bugs: silent ones.

Build Your Pipeline in Four Stages

Here’s a pipeline structure built for Spark teams, in the order things should execute:

Stage 1: Validate. Run your tests before anything deploys. If you have PySpark unit tests (even five of them), execute them against a local SparkSession or a lightweight Fabric environment. This catches broken imports, renamed functions, and bad type signatures. The goal isn’t 100% test coverage. The goal is catching the obvious failures before they reach a workspace anyone else depends on.

Stage 2: Build. Initialize the FabricWorkspace object with your target workspace ID, environment name, repository path, and scoped item types. For Spark teams, start with ["Notebook", "SparkJobDefinition", "Environment", "Lakehouse"]. Do not scope every item type on day one. Start with the items you deploy weekly. Expand scope after the first month, when you’ve seen how it behaves.

Stage 3: Deploy. Call publish_all_items(). The tool resolves dependency ordering, so if a notebook depends on a lakehouse that depends on an environment configuration, the sequence is handled. After publishing, call unpublish_all_orphan_items() to clean up workspace items that no longer appear in the repo. Skipping orphan cleanup means your workspace accumulates dead items that confuse the team and waste capacity.

Stage 4: Verify. This is the stage teams skip, and the one that saves them. After deployment, run a smoke test against the target workspace. Can the notebook open? Does it bind to the correct lakehouse? Can a lightweight execution complete without errors? A deployment that returns exit code zero but leaves notebooks pointing at a deleted lakehouse is not a successful deployment. Your pipeline shouldn’t treat it as one.

Guardrails Worth the Setup Cost

Guardrails turn a pipeline from a deployment mechanism into a safety net. These four are worth the setup time:

Approval gates. Require explicit human approval before any deployment to Production. fabric-cicd won’t enforce this for you. Wire it into your pipeline platform: Azure DevOps release gates, GitHub Actions environments with required reviewers. The first time a broken merge auto-deploys to production, you’ll wish you had spent the twenty minutes setting this up.

Service principal authentication. Run your pipeline under a service principal, not a user account. Give the principal workspace contributor access on the target workspace. Nothing more. When someone leaves the team or changes roles, deployments keep working because they never depended on that person’s credentials.

Tested rollback. Since fabric-cicd does full deployments from the repo, rollback means redeploying the last known-good commit. Conceptually clean. But “conceptually clean” doesn’t help you during an incident when stakeholders need answers fast. Test the rollback. Revert a deployment on a Tuesday afternoon when nothing is on fire. Confirm the workspace returns to its previous state. If you haven’t tested it, your rollback plan is still untested — and untested plans have a way of surprising you at the worst possible moment.

Deployment artifacts. Every pipeline run should log which items deployed, which parameters were substituted, and which orphans were removed. When production breaks and someone asks “what changed since yesterday?”, the answer should take thirty seconds, not three hours of comparing workspace states by hand.

Spark-Specific Problems Nobody Warns You About

General CI/CD guidance covers the broad strokes. Spark teams hit problems that live in the details:

Lakehouse bindings are buried in notebook content. The notebook-content.py file contains lakehouse and workspace GUIDs. If your parameter.yml misses even one of these, the production notebook opens to a “lakehouse not found” error. Audit every notebook, including the utility notebooks that other notebooks call with %run. Those hidden dependencies are where the bindings go wrong.

Environment items gate notebook execution. When your Spark notebooks depend on a custom Environment with specific Python libraries or Spark configuration properties, that Environment must exist in the target workspace before the notebooks arrive. The fabric-cicd dependency resolver handles this automatically, but only if Environment is in your item_type_in_scope. Scope just Notebook without Environment, and you’ll get clean deployments followed by runtime failures when the expected libraries don’t exist.

SparkJobDefinitions are not notebooks. SJDs carry executor counts, driver memory settings, reference files, and command-line arguments. All environment-specific values in these properties need coverage in your parameter file. Teams that parameterize their notebooks thoroughly and forget about their SJDs discover the gap when a production batch job runs with dev-sized executors and takes four times longer than expected.

Full deployment at scale needs scoping. Fifty notebooks deploy in minutes. Three hundred notebooks take longer and increase your blast radius. If your workspace has grown large, segment your repository by domain or narrow item_type_in_scope per pipeline to keep deployment times predictable and failures contained to a known set of items.

A Four-Week Migration Path

Starting from zero, here’s a timeline that’s aggressive but achievable:

Week 1: Git integration. Connect your workspace to source control. Rename items that need renaming. Agree on a branching strategy with the team. Write it down. Nothing deploys this week. This is foundation work, and skipping it makes everything after it harder.

Week 2: First deployment. Install fabric-cicd, write your initial parameter.yml, and run a deployment to a test workspace from the command line. Intentionally break the lakehouse binding in the parameter file. See what the error looks like. Fix it. Run it again. You want the team to recognize deployment failures before they encounter one under pressure.

Week 3: Pipeline construction. Build the CI/CD pipeline for Dev-to-Test promotion. Add approval gates, service principal auth, logging, and the verify stage. Run the pipeline ten times. Deliberately introduce a bad merge and watch the pipeline catch it. If it doesn’t catch it, fix the pipeline.

Week 4: Production extension. Extend the pipeline to include Production as a target. Add smoke tests. Test your rollback procedure. Write the runbook. Walk the team through it. Make sure at least two people can operate the pipeline without you in the room.

Four weeks. Not a quarter. Not a planning exercise that stalls in sprint three. A month of focused, methodical work that moves your Spark team from manual deployment to a process that runs the same way every time, whether it’s Tuesday at noon or Saturday at midnight.

The Real Takeaway

Microsoft giving fabric-cicd the official stamp means enterprise teams can stop hesitating. The library will get more attention, faster bug fixes, and broader item type support going forward.

But the tool is only half the story. A perfectly automated pipeline that deploys unparameterized notebooks to the wrong lakehouse is worse than manual deployment, because at least manual deployment forces someone to look at what they’re pushing. Automation works best when it’s built on a disciplined foundation — the checklist, the parameter file, the tested rollback, the verify stage.

Build the checklist. Work the checklist. Invest in the hard parts now, and they’ll pay you back in every deployment after.

This post was written with help from anthropic/claude-opus-4-6

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

There’s a connector that ships with every Fabric Spark runtime. It’s pre-installed. It requires no setup. And it lets your Spark notebooks read from—and write to—Fabric Data Warehouse tables as naturally as they read Delta tables from a Lakehouse.

Most Fabric Spark users don’t know it exists. The ones who do often run into the same three or four surprises. Let’s fix both problems.

What the connector actually is

The Spark connector for Fabric Data Warehouse (synapsesql) is a built-in extension to the Spark DataFrame API. It uses the TDS protocol to talk directly to the SQL engine behind your Warehouse (or the SQL analytics endpoint of a Lakehouse). You get read and write access to Warehouse tables from PySpark, Scala Spark, or Spark SQL.

One line of code to read:

from com.microsoft.spark.fabric.Constants import Constants  df = spark.read.synapsesql("my_warehouse.dbo.sales_fact")

One line to write:

df.write.mode("append").synapsesql("my_warehouse.dbo.sales_fact")

No connection strings. No passwords. No JDBC driver management. Authentication flows through Microsoft Entra—same identity you’re logged into your Fabric workspace with. The connector resolves the SQL endpoint automatically based on workspace context.

That’s the happy path. Now let’s talk about what actually happens when you use it.

Reading: the part that mostly just works

Reading from a Warehouse table into a Spark DataFrame is the connector’s strength. The synapsesql() call supports the full three-part naming convention: warehouse_name.schema_name.table_or_view_name. It works for tables and views, including views with joins across schemas.

A few things that are genuinely useful:

Predicate pushdown works. When you chain .filter() or .limit() onto your DataFrame, the connector pushes those constraints to the SQL engine. You’re not pulling the full table into Spark memory and then filtering—the SQL engine handles the filter and sends back the subset. This matters when your Warehouse tables have hundreds of millions of rows and you only need a time-sliced sample.

df = spark.read.synapsesql("my_warehouse.dbo.sales_fact") \     .filter("order_date >= '2026-01-01'") \     .select("order_id", "customer_id", "amount")

Cross-workspace reads work. If your Warehouse lives in a different workspace than your notebook’s attached Lakehouse, you pass the workspace ID:

df = spark.read \     .option(Constants.WorkspaceId, "<target-workspace-id>") \     .option(Constants.DatawarehouseId, "<warehouse-item-id>") \     .synapsesql("my_warehouse.dbo.sales_fact")

This is genuinely powerful for hub-and-spoke architectures where your curated Warehouse sits in a production workspace and your data science notebooks live in a sandbox workspace.

Parallel reads are available. For large tables, you can partition the read across multiple Spark tasks, similar to spark.read.jdbc:

df = spark.read \     .option("partitionColumn", "order_id") \     .option("lowerBound", 1) \     .option("upperBound", 10000000) \     .option("numPartitions", 8) \     .synapsesql("my_warehouse.dbo.sales_fact")

This splits the query into eight parallel reads, each fetching a range of order_id. Without this, you get a single-threaded read that will bottleneck on large tables.

Security models pass through. If your Warehouse has object-level security (OLS), row-level security (RLS), or column-level security (CLS), those policies are enforced when Spark reads the data. Your notebook sees exactly what your identity is authorized to see. This is a meaningful difference from reading Delta files directly via OneLake, where security operates at the workspace or folder level.

Custom T-SQL queries work too. You’re not limited to reading tables—you can pass arbitrary T-SQL:

df = spark.read \     .option(Constants.DatabaseName, "my_warehouse") \     .synapsesql("SELECT TOP 1000 * FROM dbo.sales_fact WHERE region = 'WEST'")

This is handy for complex aggregations or when you want the SQL engine to do the heavy lifting before data enters Spark.

Writing: the part with surprises

Write support for the Spark-to-Warehouse connector became generally available with Runtime 1.3. It works, and it solves a real architectural problem—but it has mechanics you need to understand.

How writes actually work under the hood

The connector uses a two-phase process:

Stage: Spark writes your DataFrame to intermediate Parquet files in a staging location.
Load: The connector issues a COPY INTO command, telling the Warehouse SQL engine to ingest from the staged files.

This is the same COPY INTO pattern that powers bulk ingestion into Fabric Data Warehouse generally. It’s optimized for throughput. It is not optimized for latency on small writes.

If you’re writing a DataFrame with 50 rows, the overhead of staging files and issuing COPY INTO means the write takes materially longer than you’d expect. For small, frequent writes, this connector is not the right tool. Use T-SQL INSERT statements through a SQL connection instead.

For batch writes of thousands to millions of rows, the connector performs well. The COPY INTO path is what the Warehouse was designed for.

Save modes

The connector supports four save modes:

errorifexists (default): Fails if the table already exists.
ignore: Silently skips the write if the table exists.
overwrite: Drops and recreates the table with new data.
append: Adds rows to the existing table.

df.write.mode("overwrite").synapsesql("my_warehouse.dbo.daily_aggregates")

A common pattern: Spark computes daily aggregations from Lakehouse Delta tables, then writes the results to a Warehouse table that Power BI reports connect to. The Warehouse’s result set caching (now generally available as of January 2026) means subsequent Power BI refreshes hit cache instead of recomputing.

The timestamp_ntz gotcha

This is the single most common error people hit when writing to a Warehouse from Spark.

If your DataFrame contains timestamp_ntz (timestamp without time zone) columns, the write will fail. Fabric Data Warehouse expects time-zone-aware timestamps. The fix is a cast before you write:

from pyspark.sql.functions import col  for c in df.columns:     if dict(df.dtypes)[c] == "timestamp_ntz":         df = df.withColumn(c, col(c).cast("timestamp"))  df.write.mode("append").synapsesql("my_warehouse.dbo.target_table")

This is not documented prominently enough. If you see a Py4JJavaError during write that mentions type conversion, timestamps are the first thing to check.

What you can’t write to

The connector writes to Warehouse tables only. You cannot write to the SQL analytics endpoint of a Lakehouse—it’s read-only. If you try, you’ll get an error. This seems obvious but trips people up because the same synapsesql() method handles both reads from Warehouses and Lakehouse SQL endpoints.

Private Link limitations

If Private Link is enabled at the workspace level, both read and write operations through the connector are unsupported. If Private Link is enabled at the tenant level only, writes are unsupported but reads still work. This is a significant limitation for security-conscious deployments. Check your network configuration before building pipelines that depend on this connector.

Time Travel is not supported

Fabric Data Warehouse now supports Time Travel queries. However, the Spark connector does not pass through Time Travel syntax. If you need to query a table as of a specific point in time, you’ll need to use a T-SQL connection directly rather than the synapsesql() method.

When to use Warehouse vs. Lakehouse as your serving layer

This is the architectural question that the connector’s existence forces you to answer. You’ve got data in your Lakehouse. Spark has transformed it. Now where does it go?

Use Lakehouse Delta tables when:

Your consumers are other Spark notebooks or Spark-based ML pipelines.
You need schema evolution flexibility (Delta’s schema merge).
You want to use OPTIMIZE, VACUUM, and Z-ORDER for table maintenance.
Your data scientists need direct file access through OneLake APIs.

Use Warehouse tables when:

Your primary consumers are Power BI reports or T-SQL analysts.
You need the Warehouse’s result set caching for repeated query patterns.
You need fine-grained security (RLS, CLS, OLS) that passes through to all consumers.
You want to use T-SQL stored procedures, views, and MERGE statements for downstream transformations.
You need cross-database queries that join Warehouse tables with Lakehouse tables or other Warehouse tables.

Use both when:

Spark processes and stores data in the Lakehouse (bronze → silver → gold medallion layers), then the connector writes final aggregations or serving tables to the Warehouse.
The Warehouse serves as the “last mile” between your data engineering work and your business intelligence layer.

The January 2026 GA of MERGE in Fabric Data Warehouse makes the “write to Warehouse” pattern significantly more useful. You can now do incremental upserts: Spark writes a staging table, then a T-SQL MERGE reconciles it with the target. This is a common pattern in data warehousing that was previously awkward in Fabric.

A concrete pattern: Spark ETL → Warehouse serving layer

Here’s the pattern I see working well in production:

# 1. Read from Lakehouse Delta tables (Spark native) bronze = spark.read.format("delta").load("Tables/raw_orders")  # 2. Transform in Spark silver = bronze.filter(col("status") != "cancelled") \     .withColumn("order_date", col("order_ts").cast("date")) \     .withColumn("amount_usd", col("amount") * col("fx_rate"))  gold = silver.groupBy("region", "order_date") \     .agg(         count("order_id").alias("order_count"),         sum("amount_usd").alias("total_revenue")     )  # 3. Write to Warehouse for Power BI consumption gold.write.mode("overwrite").synapsesql("analytics_warehouse.dbo.daily_revenue")

The Lakehouse owns the raw and transformed data. Spark does the heavy compute. The Warehouse serves the final tables to downstream consumers with T-SQL access, caching, and fine-grained security.

The alternative—writing gold tables to the Lakehouse and having Power BI connect via the SQL analytics endpoint—also works. But the SQL analytics endpoint has a metadata sync delay after Spark writes new data. The Warehouse table is immediately consistent after the COPY INTO completes. If your reporting needs to reflect the latest pipeline run without a sync lag, the Warehouse path is more reliable.

Cross-database queries: the glue between them

Once you have data in both a Lakehouse and a Warehouse in the same workspace, you can query across them using T-SQL cross-database queries from the Warehouse:

SELECT w.customer_id, w.total_revenue, l.customer_segment FROM analytics_warehouse.dbo.daily_revenue AS w JOIN my_lakehouse.dbo.customer_dim AS l     ON w.customer_id = l.customer_id

This means your Warehouse doesn’t need to contain all the data. It can hold the curated aggregations while joining against dimension tables that live in the Lakehouse. No data movement. No duplication. The SQL engine resolves both sources through OneLake.

Performance notes from the field

A few observations from real workloads:

Reads are faster than you expect. The TDS protocol connection to the Warehouse SQL engine is efficient. For typical analytical queries returning thousands to low millions of rows, the synapsesql() read is competitive with reading Delta files directly, especially when the Warehouse has statistics and result set caching enabled.

Writes are slower than Lakehouse writes. The two-phase staging + COPY INTO process adds overhead versus a direct df.write.format("delta").save() to Lakehouse tables. For a DataFrame with 10 million rows, expect the Warehouse write to take 2-5x longer than an equivalent Lakehouse Delta write. This is the tradeoff for getting immediate T-SQL access with full Warehouse capabilities.

Use parallel reads for large tables. The default single-partition read will bottleneck. Set numPartitions to match your Spark cluster’s available cores for large reads. The performance improvement is often 4-8x.

Proactive and incremental statistics refresh. As of January 2026, Fabric Data Warehouse supports proactive statistics refresh and incremental statistics. This means the query optimizer keeps statistics up to date automatically. Your synapsesql() reads benefit from better query plans without manual UPDATE STATISTICS calls.

The honest summary

The Spark connector for Fabric Data Warehouse is a well-designed bridge between two systems that many teams use side by side. It makes the read path simple and the write path possible without leaving your Spark notebook.

It is not a replacement for writing to Lakehouse Delta tables. It is an additional output path for when your downstream consumers need T-SQL, fine-grained security, result set caching, or immediate consistency. Use it when the Warehouse is the right serving layer. Don’t use it when Lakehouse is sufficient.

The biggest wins come from combining both: Spark for compute, Lakehouse for storage, Warehouse for serving. The connector is the plumbing that makes that architecture work without data pipelines in between.

If you’re heading to FabCon Atlanta (March 16-20, 2026), both the Data Warehouse and Data Engineering teams will be there. It’s a good place to pressure-test your architecture and see what’s coming next.

This post was written with help from anthropic/claude-opus-4-6

From Demo to Production: ML-Enriched Power BI in Microsoft Fabric

Microsoft published a new end-to-end pattern last week. Train a model inside Fabric. Score it against a governed semantic model. Push predictions straight into Power BI. No data exports. No credential juggling.

The blog post walks through a churn-prediction scenario. Semantic Link pulls data from a governed Power BI semantic model. MLflow tracks experiments and registers models. The PREDICT function runs batch inference in Spark. Real-time endpoints serve predictions through Dataflow Gen2. Everything lives in one workspace, one security context, one OneLake.

It reads well. It demos well.

But demo code is not production code. The gap between “it runs in my notebook” and “it runs every Tuesday at 4 AM without paging anyone” is exactly where Fabric Spark teams bleed time.

This is the checklist for crossing that gap.

Prerequisites that actually matter

The official blog assumes a Fabric-enabled workspace and a published semantic model. That is the starting line. Production is a different race.

Capacity planning comes first. Fabric Spark clusters consume capacity units. A batch scoring job running on an F64 during peak BI refresh hours competes for the same CUs your report viewers need. Run scoring in off-peak windows, or provision a separate capacity for data science workloads. Either way, know your CU ceiling before your first experiment. Discovering your scoring job throttles the CFO’s dashboard refresh is not a conversation you want to have.

Workspace isolation is not optional. Dev, test, prod. Semantic models promoted through deployment pipelines. ML experiments pinned to dev. Registered models promoted to prod only after validation passes. If your team trains models in the same workspace where finance runs their quarterly close dashboard, you are one accidental publish away from explaining why the revenue numbers just changed.

MLflow model signatures must be populated from day one. The PREDICT function requires them. No signature, no batch scoring. This constraint is easy to forget during prototyping and expensive to fix later. Make it a rule: every mlflow.sklearn.log_model call includes an infer_signature output. No exceptions. Write a pre-commit hook if you have to.

Semantic Link: the part most teams underestimate

Semantic Link connects your Power BI semantic model to your Spark notebooks. Call fabric.read_table() and you get governed data. Same measures and definitions your business users see in their reports. The data in your model’s training set matches what shows up in Power BI.

This matters more than it sounds.

Every analytics team that has been around long enough has a story about metric inconsistency. “Active customer” means one thing in the DAX model, another thing in the SQL pipeline, and a third thing in the data scientist’s Python notebook. The numbers diverge. Somebody notices. A week of forensic reconciliation follows.

Semantic Link kills that problem at the root. But only if you use it deliberately.

Start with fabric.list_measures(). Audit what DAX measures exist. Understand which ones your model depends on. Then pull data with fabric.read_table() rather than querying lakehouse tables directly. When you need to engineer features beyond what the semantic model provides, document every derivation in a version-controlled notebook. Written down and committed. Not living in someone’s memory or buried in a thread.

Training guardrails worth building

The Fabric blog shows a clean LightGBM training flow with MLflow autologging. That is the happy path. Production needs the unhappy path covered too.

Validate data before training. Check row counts against expected baselines. Check for null spikes in key columns. Check that the class distribution has not shifted beyond your predefined threshold. A model trained on corrupted or stale data produces confident garbage. Confident garbage is worse than no model at all, because people act on it.

Tag every experiment run. MLflow in Fabric supports custom tags. Use them aggressively. Tag each run with the semantic model version it pulled from, the notebook commit hash, and the data snapshot date. Three months from now, when a stakeholder asks why the model flagged 200 customers as high churn risk and zero of them actually left, you need to reconstruct exactly what happened. Without tags, you are guessing.

Build a champion-challenger gate. Before any new model version reaches production, it must beat the current model on a holdout set from the most recent data. Not any holdout set. The most recent one. Automate this comparison in a validation notebook that runs as a pipeline step before model registration. If the challenger fails to clear the margin you defined upfront, the pipeline halts. No override button. No “let’s just push it and see.” The gate exists to prevent optimism from substituting for evidence.

Batch scoring: the PREDICT function in production

Fabric’s PREDICT function is straightforward. Pass a registered MLflow model and a Spark DataFrame. Get predictions back. It supports scikit-learn, LightGBM, XGBoost, CatBoost, ONNX, PyTorch, TensorFlow, Keras, Spark, Statsmodels, and Prophet.

The production requirements are few but absolute.

Write predictions to a delta table in OneLake. Not to a temporary DataFrame that dies with the session. Partition that table by scoring date. Add a column for the model version that generated each row. This is your audit trail. When someone asks “why did customer 4471 show as high risk last Tuesday?”, you pull the partition, check the model version, and have an answer in minutes. Without that structure, the same question costs you a day.

Chain your scoring job to run after your semantic model refresh. Sequence matters. If the model scores data from the prior refresh cycle, your predictions are one step behind reality. Use Fabric pipelines to enforce the dependency explicitly. Refresh completes, scoring starts.

Real-time endpoints: know exactly what you are signing up for

Fabric now offers ML model endpoints in preview. Activate one from the model registry. Fabric spins up managed containers and gives you a REST API. Dataflow Gen2 can call the endpoint during data ingestion, enriching rows with predictions in flight.

The capability is real. The constraints are also real.

Real-time endpoints support a limited set of model flavors: Keras, LightGBM, scikit-learn, XGBoost, and (since January 2026) AutoML-trained models. PyTorch, TensorFlow, and ONNX are not supported for real-time serving. If your production model uses one of those frameworks, batch scoring is your only path.

The auto-sleep feature deserves attention. Endpoints scale capacity to zero after five minutes without traffic. The first request after sleep incurs a cold-start delay while containers spin back up. For use cases that need consistent sub-second latency, you have two options: disable auto-sleep and accept the continuous capacity cost, or send periodic synthetic requests to keep the endpoint warm.

The word “preview” is load-bearing here. Preview means the API can change between updates. Preview means SLAs are limited. Preview means you need a batch-scoring fallback in place before you route any production workflow through a real-time endpoint. Build the fallback first. Test it. Then add the real-time path as an optimization on top.

The rollback plan you need to write before you ship

Most teams build forward. They write the training pipeline, the scoring job, the endpoint, the Power BI report that consumes predictions. Then they ship.

Nobody writes the backward path. Until something goes wrong.

Your rollback plan has three parts.

First, keep at least two prior model versions in the registry. If the current version starts producing bad predictions, you roll back by updating the model alias. One API call. The scoring pipeline picks up the previous version on its next run.

Second, partition prediction tables by date and model version. Rolling back a model means nothing if downstream reports still display the bad predictions. With partitioned tables, you can filter or drop the scoring run from the misbehaving version and revert to the prior run’s output.

Third, a kill switch for real-time endpoints. One API call to deactivate the endpoint. Traffic falls back to the latest batch-scored delta table. Your Power BI report keeps working, just without real-time enrichment, while you figure out what went wrong.

Test this plan. Not on paper. Run the rollback end to end in your dev environment. Time it. If reverting to a stable state takes longer than fifteen minutes, your plan is too complicated. Simplify it until the timer clears.

Ship it

The architecture Microsoft described is sound. Semantic Link for governed data access. MLflow for experiment tracking and model registration. PREDICT for batch scoring to OneLake. Real-time endpoints for low-latency enrichment. Power BI consuming prediction tables through DirectLake or import.

But architecture alone does not keep a system running at 4 AM. The capacity plan does. The workspace isolation does. The data validation gate, the champion-challenger check, the scoring sequence, the endpoint fallback, the rollback drill. Those are what separate a demo from a service.

Do the checklist. Test the failure modes. Then ship.

This post was written with help from anthropic/claude-opus-4-6

Optimizing Spark Performance with the Native Execution Engine (NEE) in Microsoft Fabric

Spark tuning often starts with the usual suspects (shuffle volume, skew, join strategy, caching)… but sometimes the biggest win is simply executing the same logical plan on a faster engine.

Microsoft Fabric’s Native Execution Engine (NEE) does exactly that: it keeps Spark’s APIs and control plane, but runs a large portion of Spark SQL / DataFrame execution on a vectorized C++ engine.

What NEE is (and why it’s fast)

NEE is a vectorized native engine that integrates into Fabric Spark and can accelerate many SQL/DataFrame operators without you rewriting your code.

You still write Spark SQL / DataFrames.
Spark still handles distributed execution and scheduling.
For supported operators, compute is offloaded to a native engine (reducing JVM overhead and using columnar/vectorized execution).

Fabric documentation calls out NEE as being based on Apache Gluten (the Spark-to-native glue layer) and Velox (the native execution library).

When NEE tends to help the most

NEE shines when your workload is:

SQL-heavy (joins, aggregates, projections, filters)
CPU-bound (compute dominates I/O)
Primarily on Parquet / Delta

You’ll see less benefit (or fallback) when you rely on features NEE doesn’t support yet.

How to enable NEE (3 practical options)

1) Environment-level toggle (recommended for teams)

In your Fabric Environment settings, go to Acceleration and enable the native execution engine, then Save + Publish.

Benefit: notebooks and Spark Job Definitions that use that environment inherit the setting automatically.

2) Enable for a single notebook / job via Spark config

In a notebook cell:

%%configure {   "conf": {     "spark.native.enabled": "true"   } }

For Spark Job Definitions, add the same Spark property.

3) Disable/enable per-query when you hit unsupported features

If a specific query uses an unsupported operator/expression and you want to force JVM Spark for that query:

SET spark.native.enabled=FALSE; -- run the query SET spark.native.enabled=TRUE;

How to confirm NEE is actually being used

Two low-friction checks:

Spark UI / History Server: look for plan nodes ending with Transformer or nodes like *NativeFileScan / VeloxColumnarToRowExec.
df.explain(): the same Transformer / NativeFileScan / Velox… hints should appear in the plan.

Fabric also exposes a dedicated view (“Gluten SQL / DataFrame”) to help spot which queries ran on the native engine vs. fell back.

Fallback is a feature (but you should know the common triggers)

NEE includes an automatic fallback mechanism: if the plan contains unsupported features, Spark will run that portion on the JVM engine.

A few notable limitations called out in Fabric documentation:

UDFs aren’t supported (fallback)
Structured streaming isn’t supported (fallback)
File formats like CSV/JSON/XML aren’t accelerated
ANSI mode isn’t supported

There are also some behavioral differences worth remembering (rounding/casting edge cases) if you have strict numeric expectations.

A pragmatic “NEE-first” optimization workflow

Turn NEE on for the environment (or your job) and rerun the workload.
If it’s still slow, open the plan and answer: is the slow part running on the native engine, or did it fall back?
If it fell back, make the smallest possible change to keep the query on the native path (e.g., avoid UDFs; prefer built-in expressions; standardize on Parquet/Delta).
Once the plan stays mostly native, go back to classic Spark tuning: reduce shuffle volume, fix skew, sane partitioning, and confirm broadcast joins.

References

This post was written with help from ChatGPT 5.2

The Best Thing That Ever Happened to Your Spark Pipeline Is a SQL Database

Here’s a counterintuitive claim: the most important announcement for Fabric Spark teams in early 2026 has nothing to do with Spark.

It’s a SQL database.

Specifically, it’s the rapid adoption of SQL database in Microsoft Fabric—a fully managed, SaaS-native transactional database that went GA in November 2025 and has been quietly reshaping how production data flows into lakehouse architectures ever since. If you’re a data engineer running Spark workloads in Fabric, this changes more than you think.

The ETL Pipeline You Can Delete

Most Spark data engineers have a familiar pain point: getting operational data from transactional systems into the lakehouse. You build ingestion pipelines. You schedule nightly batch loads. You wrestle with CDC (change data capture) configurations, watermark columns, and retry logic. You maintain all of it, forever.

SQL database in Fabric eliminates that entire layer.

When data lands in a Fabric SQL database, it’s automatically replicated to OneLake as Delta tables in near real-time. No pipelines. No Spark ingestion jobs. No orchestration. The data just appears, already in the open Delta format your notebooks and Spark jobs expect.

This isn’t a minor convenience—it’s an architectural shift. Every ingestion pipeline you don’t write is a pipeline you don’t debug at 2 AM.

What This Actually Looks Like in Practice

Let’s say you’re building an analytics layer on top of an operational SaaS application. Today, your architecture probably looks something like this:

Application writes to Azure SQL or Cosmos DB
ADF or Spark job pulls data on a schedule
Data lands in a lakehouse as Delta tables
Downstream Spark jobs transform and aggregate

With SQL database in Fabric, steps 2 and 3 vanish. Your application writes directly to the Fabric SQL database, and the mirrored Delta tables are immediately available for Spark processing. Here’s what your downstream notebook looks like now:

# Read operational data directly — no ingestion pipeline needed # The SQL database auto-mirrors to OneLake as Delta tables orders_df = spark.read.format("delta").load(     "abfss://your-workspace@onelake.dfs.fabric.microsoft.com/your-sqldb.SQLDatabase/dbo.Orders" )  # Your transformation logic stays the same from pyspark.sql import functions as F  daily_revenue = (     orders_df     .filter(F.col("order_date") >= F.date_sub(F.current_date(), 7))     .groupBy("product_category")     .agg(         F.sum("total_amount").alias("revenue"),         F.countDistinct("customer_id").alias("unique_customers")     )     .orderBy(F.desc("revenue")) )  daily_revenue.write.format("delta").mode("overwrite").saveAsTable("gold.weekly_revenue_by_category")

The Spark code doesn’t change. What changes is everything upstream of it.

The Migration Risk Nobody’s Talking About

Here’s where it gets interesting—and where Malcolm Gladwell would lean forward in his chair. The biggest risk of SQL database in Fabric isn’t technical. It’s organizational.

Teams that have invested heavily in ingestion infrastructure will face a classic innovator’s dilemma: the new path is simpler, but the old path already works. The temptation is to keep running your existing ADF pipelines alongside the new auto-mirroring capability, creating a hybrid architecture that’s worse than either approach alone.

My recommendation: don’t hybrid. Pick a workload, migrate it end-to-end, and measure. Here’s a concrete rollout checklist:

Identify a candidate workload — Look for Spark jobs whose primary purpose is pulling data from a SQL source into Delta tables. These are your highest-value migration targets.
Provision a Fabric SQL database — It takes seconds. You provide a name; Fabric handles the rest. Autoscaling and auto-pause are built in.
Redirect your application writes — Point your operational application to the new Fabric SQL database. The engine is the same SQL Database Engine as Azure SQL, so T-SQL compatibility is high.
Validate the Delta mirror — Confirm that your data is appearing in OneLake. Check schema fidelity, latency, and row counts:

# In your Spark notebook, validate the mirrored data spark.sql("""     SELECT COUNT(*) as row_count,            MAX(modified_date) as latest_record,            MIN(modified_date) as earliest_record     FROM your_sqldb.dbo.Orders """).show()

Decommission the ingestion pipeline — Once validated, turn off the ADF or Spark ingestion job. Don’t just disable it—delete it. Zombie pipelines are how technical debt accumulates.
Update your monitoring — Your existing data quality checks should still work since the Delta tables have the same schema. But update your alerting to watch for mirror latency instead of pipeline run failures.

The AI Angle Matters for Spark Teams Too

There’s a second dimension to this announcement that Spark engineers should pay attention to: the native vector data type in SQL database supports semantic search and RAG patterns directly in the transactional layer.

Why does that matter for Spark teams? Because it means your embedding pipelines can write vectors back to the same database your application reads from—closing the loop between batch ML processing in Spark and real-time serving in SQL. Instead of maintaining a separate vector store (Pinecone, Qdrant, etc.), you use the same SQL database that’s already mirrored into your lakehouse.

This is the kind of architectural simplification that compounds over time. Fewer systems means fewer failure modes, fewer credentials to manage, and fewer things to explain to your successor.

The Rollout Checklist

This week: Inventory your existing ingestion pipelines. How many just move data from SQL sources to Delta?
This sprint: Provision a Fabric SQL database and test the auto-mirror with a non-critical workload.
This quarter: Migrate your highest-volume ingestion pipeline and measure CU savings.
Track: Mirror latency, CU consumption before/after, and pipeline maintenance hours eliminated.

SQL database in Fabric went GA in November 2025 with enterprise features including row-level security, customer-managed keys, and private endpoints. For the full list of GA capabilities, see the official announcement. To understand how this fits into the broader Microsoft database + Fabric integration strategy, read Microsoft Databases and Microsoft Fabric: Your unified and AI-powered data estate. For Spark-specific Delta Lake concepts, the Delta Lake documentation remains the authoritative reference.

The best thing about this announcement isn’t any single feature. It’s that it makes your Spark architecture simpler by removing the parts that shouldn’t have been there in the first place.

This post was written with help from Claude Opus 4.6

Monitoring Spark Jobs in Real Time in Microsoft Fabric

If Spark performance work is surgery, monitoring is your live telemetry.

Microsoft Fabric gives you multiple monitoring entry points for Spark workloads: Monitor hub for cross-item visibility, item Recent runs for focused context, and application detail pages for deep investigation. This post is a practical playbook for using those together.

Why this matters

When a notebook or Spark job definition slows down, “run it again” is the most expensive way to debug. Real-time monitoring helps you:

spot bottlenecks while jobs are still running
isolate failures quickly
compare behavior across submitters and workspaces

1) Start at the Monitoring hub for cross-workspace triage

Use Monitoring in the Fabric navigation pane as your control tower.

Filter by item type (Notebook, Spark job definition, Pipeline)
Narrow by start time and workspace
Sort by duration or status to surface outliers

For broad triage, this is faster than jumping directly into individual notebooks.

2) Pivot to Spark application details for root-cause analysis

Once you identify a problematic run, open the Spark application detail page and work through tabs in order:

Jobs: status, stages, tasks, duration, and processed/read/written data
Resources: executor allocation and utilization in near real time
Logs: inspect Livy, Prelaunch, and Driver logs; download when needed
Item snapshots: confirm exactly what code/parameters/settings were used at execution time

This sequence prevents false fixes where you tune the wrong layer.

3) Use notebook contextual monitoring while developing

For iterative tuning, notebook contextual monitoring keeps authoring, execution, and debugging in one place.

Run a target cell/workload
Watch job/stage/task progress and executor behavior
Jump to Spark UI or detail monitoring for deeper traces
Adjust code or config and rerun

4) A lightweight real-time runbook

Confirm scope in the Monitoring hub (single run or systemic pattern)
Open application details for the failing/slower run
Check Jobs for stage/task imbalance and long-running segments
Check Resources for executor pressure
Check Logs for explicit failure signals
Verify snapshots so you debug the exact submitted artifact

Common mistakes to avoid

Debugging from memory instead of snapshots
Looking only at notebook cell output and skipping Logs/Resources
Treating one anomalous run as a global trend without Monitor hub filtering

References

This post was written with help from ChatGPT 5.3

This is triage before it is migration

The Spark-specific catch is the part people will miss

Connection mapping is where “migrated” stops meaning “ready”

“Triggers disabled by default” is the best sentence in the whole thing

What this preview changes

A rollout pattern worth trusting

The practical takeaway

Share this:

The February release itself is small

December changed the maintenance story

January made the gateway more relevant to pipeline-heavy Spark teams

Spark teams now have a second route for on-premises access

What a Fabric Spark team should do next

The short version

Share this:

Why Spark teams should care about a Dataflow feature

What Recent data changes under pressure

What this feature doesn’t rescue you from

Preview discipline: run this like a survival checklist

The operating model for Spark leads

Share this:

What actually shipped

Why Spark engineers should care

The non-negotiable constraints

Practical rollout plan for Spark teams

1) Inventory duplication first

2) Stand up a dedicated execution dataflow

3) Wrap Execute Query behind one notebook utility

4) Add governance checks before scale

5) Monitor capacity from day one

Where this fits (and where it doesn’t)

The organizational shift hiding behind the API

Source notes

Share this:

What Fabric-CICD Does (and Where It Stops)

Prerequisite Zero: Get Your Git House in Order

The Parameter File: The Single Most Important Artifact

Build Your Pipeline in Four Stages

Guardrails Worth the Setup Cost

Spark-Specific Problems Nobody Warns You About

A Four-Week Migration Path

The Real Takeaway

Share this:

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

What the connector actually is

Reading: the part that mostly just works

Writing: the part with surprises

How writes actually work under the hood

Save modes

The timestamp_ntz gotcha

What you can’t write to

Private Link limitations

Time Travel is not supported

When to use Warehouse vs. Lakehouse as your serving layer

A concrete pattern: Spark ETL → Warehouse serving layer

Cross-database queries: the glue between them

Performance notes from the field

The honest summary

Share this:

Prerequisites that actually matter

Semantic Link: the part most teams underestimate

Training guardrails worth building

Batch scoring: the PREDICT function in production

Real-time endpoints: know exactly what you are signing up for

The rollback plan you need to write before you ship

Ship it

Share this:

What NEE is (and why it’s fast)

When NEE tends to help the most

How to enable NEE (3 practical options)

1) Environment-level toggle (recommended for teams)

2) Enable for a single notebook / job via Spark config

3) Disable/enable per-query when you hit unsupported features

How to confirm NEE is actually being used

Fallback is a feature (but you should know the common triggers)

A pragmatic “NEE-first” optimization workflow

References

Share this:

The ETL Pipeline You Can Delete

What This Actually Looks Like in Practice