Operationalizing Fabric’s February 2026 feature drop: what actually matters for Spark teams

Operationalizing Fabric's February 2026 feature drop: what actually matters for Spark teams

Microsoft’s monthly feature summaries have a familiar problem. They flatten every change into the same cheerful pitch. A new cell editor mode gets about the same oxygen as a moving security boundary. If you run Spark seriously on Fabric, that is useless. You need to know which items change architecture, which clean up the daily notebook grind, and which quietly add a new failure mode.

February’s release has all three. The headline is not “more features.” The headline is that Fabric keeps removing excuses for portal-driven, manually operated Spark environments. More of the platform can now be secured, composed, and managed through code. That is good news. It also means the easier Microsoft makes this, the more discipline you need on your side.

The change that actually alters architecture

CMK support for notebook code

This is the big one.

Fabric notebooks can now run inside CMK-enabled workspaces, with notebook content and associated notebook metadata encrypted at rest using customer-owned keys in Azure Key Vault. Microsoft is not vague about the coverage. The post calls out cell source, cell output, and cell attachments.

If your team has been splitting its development pattern because notebooks were the odd object out in a tighter security model, that split is no longer structurally required. Plenty of enterprises ended up with an awkward arrangement: secure workspaces for governed assets, then a side channel for notebook authoring and iteration. February closes that gap.

The payoff is boring in the best way. Fewer workarounds. Fewer places where permissions drift. Fewer security reviews where someone has to explain why the code path lives outside the workspace standard applied to everything else.

It also changes the migration conversation. Teams that avoided notebooks in regulated environments can revisit that decision. Teams already on notebooks can ask whether a separate architecture still buys them anything except paperwork.

The catch is operational, not conceptual. Keys rotate. Policies get tightened. When notebook content and metadata sit under the same CMK envelope, key management stops being an abstract security exercise and starts touching the authoring surface your engineers use every day. If you do not test rotation and recovery in a non-production workspace first, you are volunteering to learn in public.

The workflow fix Spark teams needed months ago

Python notebooks finally get %run

This was overdue.

PySpark notebooks had a workable modularity story. Python notebooks did not. If you wanted shared setup logic, common helper functions, or a standardized preamble, you either copied code between notebooks or invented a packaging scheme to compensate for a missing primitive.

Now Python notebooks support %run. You can reference and execute other notebooks in the same execution context, then directly use the functions and variables defined there. That is the difference between notebook code as a pile of local accidents and notebook code as something you can organize on purpose.

There is one limitation, and it matters: today %run in Python notebooks supports notebook items only. It does not yet run .py modules from the notebook resources folder. Microsoft says that support is coming soon. Fine. “Coming soon” is not an architecture. Build around notebook references now, and treat resource-folder module execution as a future upgrade if it arrives on time.

The immediate move for most teams is simple. Pull duplicated utility code into shared notebooks. Keep them small. Keep ownership clear. Do not turn %run into a dependency swamp where every notebook imports half the workspace and nobody can explain execution order without drawing a crime-scene diagram.

Version history now tells you where a change came from

This sounds like a minor quality-of-life improvement until you have to debug a bad deployment before the second cup of coffee.

Fabric notebook version history now labels the source of each saved version. Direct edits in the notebook, Git synchronizations, deployment pipeline updates, and publishing via VS Code all show up as distinct origins. That one label removes a stupid amount of ambiguity.

Before this, the question “what changed?” was followed by the more annoying question “through which path?” In a serious CI/CD setup, that distinction is the whole investigation. A manual portal edit points you to one human. A Git sync points you to a repo change. A deployment pipeline update points you to release plumbing. VS Code publishing points you somewhere else again. Same broken notebook, different root cause.

If your team uses more than one of these paths, update the runbook. The first step in notebook incident triage should now be checking the version source before anyone starts diffing content like a raccoon digging through a dumpster.

Full-size mode is small, but not trivial

Full-size mode lets a single notebook cell fill the workspace for editing. That is not glamorous. It is just useful.

Large SQL blocks, ugly transformation cells, and screenshared code reviews all get easier when the interface stops fighting you. Features like this do not make press-release people happy, but they do shave friction off work that happens every day. I would not redesign an architecture around it. I would absolutely use it.

The broader pattern hiding inside the release

Fabric is making Spark more reachable from both directions

Two February items matter together.

The new Microsoft ODBC Driver for Fabric Data Engineering gives external applications and ODBC-compatible tools a supported path into Spark SQL on Fabric. Microsoft describes it as ODBC 3.x compliant, backed by Livy APIs, and built for OneLake and Lakehouse data with Entra ID authentication, proxy support, session reuse, and Spark SQL coverage that looks designed for real workloads instead of demos.

Then there is Semantic Link 0.13.0. That release expands management coverage across lakehouses, reports, semantic models, SQL endpoints, and Spark. Microsoft is explicit about the direction: creating and managing lakehouses and tables, cloning and rebinding reports, refreshing and monitoring semantic models, and administering SQL and Spark settings from code.

Put those together and the platform’s direction is obvious. Fabric wants Spark environments that can be queried from outside and administered from inside code, without the portal as the center of the universe. That is the right direction. The portal is useful. The portal is not a control plane.

This is also where teams get themselves into trouble. The moment workspace operations become scriptable, governance stops being a policy deck and becomes a permissions design problem. If every engineer can programmatically create lakehouses, modify Spark settings, and rebind reports, then congratulations: you have built an accidental infrastructure platform. Maybe that is fine. Maybe it is a terrible idea. Decide before the scripts proliferate.

My bias is blunt. Treat Semantic Link as production infrastructure tooling, not as a convenience library. Set conventions early. Define who can do what. Log changes. Review the scripts that touch shared assets. Otherwise you will end up with beautiful automation and feral workspaces.

The quiet footgun in the admin section

Fabric identity limits now scale higher, but Fabric will not save you from bad math

Fabric now raises the default tenant limit for Fabric identities from 1,000 to 10,000. That is a real scale change, and for some organizations it removes an artificial ceiling that was starting to pinch.

It also lets admins set custom limits and manage them through the Update Tenant Setting REST API. Good. That is how this should work.

The problem is the warning Microsoft slips into the text: Fabric does not validate whether your custom limit fits within your Entra ID resource quota.

That means the setting feels authoritative while depending on an external quota boundary it does not enforce. In other words, the UI and API will happily let you declare ambition. Entra ID is the system that decides whether ambition has a permit.

So before anyone bumps the limit because “10,000 sounds better,” check the Entra side first. If you automate the setting, add that quota check to the automation. This is not exotic engineering. It is basic adult supervision.

What I would do this week

If you own Spark on Fabric, February’s release suggests a short, unromantic punch list.

  • Review whether CMK support lets you collapse any split workspace pattern built around notebook restrictions.
  • Start using %run in Python notebooks for shared helpers, but keep the dependency graph understandable.
  • Update notebook incident runbooks so version-source labels are part of first response.
  • Decide whether the ODBC driver and Semantic Link belong in your standard platform toolkit, then put guardrails around both before usage spreads.
  • Check Entra ID quotas before changing Fabric identity limits, especially if a script is going to do it for you.

That is the real shape of the month. A nicer notebook editor is fine. A new driver is nice. The deeper story is that Fabric keeps shifting Spark toward a model where security, reuse, and administration happen in code instead of in tribal knowledge and portal muscle memory. That is progress. It also means the teams that win will be the ones that pair new capability with restraint, because the platform is getting powerful enough to automate your mistakes at scale.

This post was written with help from anthropic/claude-opus-4-6

Operationalizing the semantic model permissions update for Fabric data agents

Operationalizing the semantic model permissions update for Fabric data agents

Permissions in data platforms have a remarkable talent for turning a two-minute job into a small municipal drama. You want one ordinary thing. The system hands you a form, a role, a workspace, another role, and, sooner or later, a person named Steve who is out until Thursday.

Starting April 6, 2026, Microsoft Fabric removes one of those little absurdities. Creators and consumers of Fabric data agents need only Read access on the semantic model to use it through a data agent. Workspace access is no longer required.

Small sentence. Large relief.

Why this matters

Fabric data agents use Azure OpenAI to interpret a user’s question, choose the most relevant source, and generate, validate, and execute the query needed to answer it. That source might be a lakehouse, warehouse, Power BI semantic model, KQL database, or ontology.

So the agent is already doing the interesting work. It is translating a human question into something a data system can answer. Requiring extra workspace access just to reach a semantic model added bureaucracy to the wrong layer.

The change, plainly

The official change is simple: beginning April 6, creators and consumers only need Read access on the semantic model to interact with it through a Fabric data agent. The older workspace access and Build permission hurdle disappears for this path.

If you have ever untangled access requests, you can probably hear the sigh from here.

What to do with that information

The first operational question is not “What new permission do I need?” It is “Which workspace grants exist only because the old rule forced them?”

Start there.

  • List the semantic models your data agents use.
  • Identify users or groups with workspace access granted only for those agent scenarios.
  • Test the new flow with a read-only user as April 6 approaches.
  • After the change lands, remove workspace access that no longer serves a separate purpose.

This is not glamorous work. Neither is plumbing, and everyone suddenly develops strong feelings about plumbing when it breaks.

The part people will miss

One detail matters more than the permission change itself. When a Fabric data agent generates DAX for a semantic model, it relies only on the model’s metadata and Prep for AI configuration. It ignores instructions added at the data agent level for DAX query generation.

That puts responsibility where it belongs: on the model.

If a business user asks a sensible question and gets a crooked answer, the fix is usually not a cleverer agent prompt. The fix is to improve what the model gives the agent to work with: the metadata and the Prep for AI setup.

That is the real operational shift. Access gets easier. Model preparation matters more.

A sensible rollout

If you own Fabric governance, keep the rollout dull and methodical.

  • Review which data agents rely on semantic models.
  • Retest those scenarios with users who have Read access on the model and no workspace access.
  • Inspect the models that produce weak DAX and improve the metadata and Prep for AI configuration they expose.
  • Clean up workspace permissions that were granted only to satisfy the old requirement.

Nobody frames that checklist and hangs it in the lobby. It still gets the job done.

The useful conclusion

The best part of this update is that it removes a fake dependency. A data agent that can answer questions from a semantic model should not require a side trip through workspace permissions.

The catch is that the agent still cannot invent a well-prepared model out of thin air. Fabric has made access lighter. It has also made the remaining truth easier to see: if you want better answers, the semantic model has to be ready for the job.

Which is, frankly, how this should have worked all along.

This post was written with help from anthropic/claude-opus-4-6

ExtractLabel just changed how your Spark pipelines should handle unstructured data

ExtractLabel just changed how your Spark pipelines should handle unstructured data

Every data engineer eventually inherits the same cursed pipeline.

Upstream sends you a blob of human text. Somewhere in that blob are the exact facts your downstream systems need: product name, issue category, requested resolution, timeline, who did what, and when. The facts are there. They are just buried in prose written by sleep-deprived humans, copied from emails, and occasionally typed from a phone in an airport parking lot.

For years, we handled this with a pile of hacks:

  • Regex that works until one user adds a comma
  • Hand-rolled NER that drifts quietly into uselessness
  • LLM prompts that return valid JSON on Monday, improv theater on Tuesday

Then we pretend this is fine by writing 300 lines of “normalization” code downstream, plus defensive checks, plus retry logic, plus enough if statements to make your future self hate your past self.

That is the old world.

ExtractLabel is the first Fabric AI Functions primitive that treats extraction like a contract instead of a vibe. You define the shape once in JSON Schema. The extraction step returns that shape. Your pipeline gets predictable structure instead of model improv.

If you run Spark workloads in Fabric, this matters immediately.

What AI Functions already gave you (and where it fell short)

Before ExtractLabel, the quick path looked like this:

df["text"].ai.extract("name", "profession", "city")


For exploration, that is great. For production, it is a trap.

Prototype extraction asks, “Can the model find useful fields?”
Production extraction asks, “Can every downstream consumer trust type, shape, and vocabulary every single run?”

Those are different questions.

The basic label call is lightweight and convenient, but it leaves the hardest part unsolved: schema discipline. If your routing logic expects one of four categories, free-form output creates entropy. If your analytics expects arrays, and extraction returns comma-separated strings, you are writing cleanup code forever. If optional fields are not explicitly nullable, models tend to fill blanks with plausible nonsense.

The model understanding was never the bottleneck. Contract reliability was.

ExtractLabel: the schema contract your pipeline needs

ExtractLabel gives you an explicit schema boundary between unstructured input and structured output. In pandas you import from synapse.ml.aifunc; in PySpark you import from synapse.ml.spark.aifunc. The core pattern is the same: define one label with object properties, requirements, and constraints.

Concrete example, using warranty claims:

from synapse.ml.aifunc import ExtractLabel

claim_schema = ExtractLabel(
    label="claim",
    max_items=1,
    type="object",
    description="Extract structured warranty claim information",
    properties={
        "type": "object",
        "properties": {
            "product_name": {"type": "string"},
            "problem_category": {
                "type": "string",
                "enum": ["defect", "damage_in_transit", "missing_part", "other"],
                "description": "defect=stopped working or malfunctioning, damage_in_transit=arrived damaged, missing_part=something not included"
            },
            "problem_summary": {
                "type": "string",
                "description": "Max 20 words. Summarize the core issue."
            },
            "time_owned": {"type": ["string", "null"]},
            "troubleshooting_tried": {
                "type": "array",
                "items": {"type": "string"}
            },
            "requested_resolution": {
                "type": "string",
                "enum": ["replacement", "refund", "repair", "other"]
            }
        },
        "required": ["product_name", "problem_category", "problem_summary",
                     "time_owned", "troubleshooting_tried", "requested_resolution"],
        "additionalProperties": False
    }
)

df[["claim"]] = df["text"].ai.extract(claim_schema)


Input text:

“The smart thermostat stopped turning on after 12 days. I tried a reset and new batteries. Please replace it.”

Structured output:

{
    "product_name": "smart thermostat",
    "problem_category": "defect",
    "problem_summary": "Thermostat stopped turning on after 12 days",
    "time_owned": "12 days",
    "troubleshooting_tried": ["reset", "new batteries"],
    "requested_resolution": "replacement"
}


That is the difference: you are no longer extracting “some fields.” You are producing an object your systems can rely on.

The five schema features that actually matter

Most teams will over-focus on “LLM extraction” and under-focus on schema design. That is backwards. The model is only half the system. The schema is what makes it production-safe.

1) Nullable types

Use explicit nullable definitions for fields that may not exist in the source text:

"time_owned": {"type": ["string", "null"]}


If you do not allow null, the model is pressured to invent. Nullable fields reduce that pressure.

2) Enums for category control

When downstream logic expects bounded values, enforce them with enum.

That turns category assignment from fuzzy language output into controlled vocabulary. If your pipeline routes by problem_category, this is non-negotiable.

3) Arrays for true multi-value extraction

If a claim can include multiple troubleshooting actions, represent it as an array. Do not accept packed strings and split later.

Array semantics belong in extraction, not in cleanup jobs.

4) Descriptions as extraction instructions

Descriptions are not decorative comments. They are guidance for the extraction step.

Use them to define edge behavior, clarify enum intent, and enforce concise summaries. Most quality gains come from this field, not from prompt wording elsewhere.

5) Nested objects for real-world structure

Complex payloads are rarely flat. If your domain includes sub-entities, model them as nested objects now. Flattening everything into top-level strings feels easier in week one and becomes technical debt by week six.

What this means for your Spark pipelines right now

If your team already runs text extraction in Fabric pipelines, ExtractLabel gives you a clean migration path with immediate payback in reliability.

Practical rollout plan:

  1. Find the pain first. Audit extraction steps where downstream code spends time repairing output shape, casing, and categories. Those are your highest-ROI migrations.
  2. Version schemas like code. Store schema definitions in source control with explicit version tags. Treat schema changes as contract changes, not casual edits.
  3. Use one extraction contract per domain task. Do not build one giant universal schema. Warranty claims, support tickets, and contract clauses deserve separate schemas with domain-specific enums and guidance.
  4. Prefer model-based schema authoring as complexity grows. Once schemas get large, hand-editing JSON gets brittle. Define structures in typed Python models and generate JSON Schema from there. You get stronger review discipline and fewer silent mistakes.
  5. Build an evaluation harness before broad rollout. ExtractLabel enforces structure; it does not guarantee semantic correctness. Keep a labeled sample set, score extraction quality regularly, and review drift.
  6. Tune operational settings with real workload telemetry. Concurrency, retry behavior, and throughput limits should be validated in your environment, not assumed from defaults. Measure error columns and latency under realistic load before declaring victory.

Verify runtime, capacity, and governance prerequisites against current Fabric documentation in your tenant before rollout. Platform details move. Your production runbooks should not rely on stale assumptions.

Migration risks worth thinking about

ExtractLabel is strong, but this is still LLM-powered extraction. You need grown-up operating discipline.

Model behavior drift

Even with stable schema shape, semantic interpretation can shift over time. A phrase that mapped to defect last month might map to other after a model update.

Mitigation: maintain a regression set and run periodic quality checks. Contract shape is necessary. Accuracy monitoring is mandatory.

Cost surprises at volume

Row-wise AI extraction scales linearly with data volume. Teams underestimate this, then panic when ingestion spikes.

Mitigation: test on representative daily volume, not a toy sample. Budget for peak days, not median days.

Schema evolution pain

You will add fields. You will split categories. You will regret one enum name. That is normal.

Mitigation: include schema version metadata in outputs and plan how downstream consumers handle mixed historical versions.

False confidence from “valid JSON”

Teams see valid typed output and stop questioning semantics. That is how bad extractions get into trusted dashboards.

Mitigation: sample manually, review periodically, and keep humans in the QA loop for high-impact fields.

When to use ExtractLabel vs. other approaches

Use ExtractLabel when all of these are true:

  • Input is unstructured text
  • Output must be typed and schema-conforming
  • You need extraction embedded in Fabric data workflows

Keep regex when the task is deterministic and mechanical (IDs, fixed-format dates, known token patterns).

Keep specialized NER pipelines when domain vocabulary is unusual, latency requirements are strict, or inference cost constraints are severe.

Use document-native extraction tools when layout matters (forms, scans, tables in images/PDFs). Text-column extraction will not recover geometry it never saw.

If your instinct is “we can just prompt harder,” stop. That is how you build a fragile system that passes demos and fails operations.

The bottom line

ExtractLabel moves Fabric extraction from improvisation to contracts.

The shiny part is one line of code:

df[["claim"]] = df["text"].ai.extract(claim_schema)

The valuable part is everything you encode in the schema: allowed values, nullability, nested structure, and descriptive guidance for edge cases.

Do that work once, and your downstream pipeline stops behaving like a cleanup crew.

Less duct tape, more reliable data.


This post was written with help from anthropic/claude-opus-4-6

What “Recent data” in Fabric means for Spark teams when time is the real bottleneck

At 8:07 a.m., nobody on a data engineering team is debating architecture purity. You’re trying to get back to the exact source you were fixing yesterday before another downstream notebook fails and somebody asks for an ETA.

That’s the problem Microsoft Fabric’s Recent data feature targets.

The feature landed in the February 2026 Fabric update and is currently in preview. It sounds small: Dataflow Gen2 remembers the specific items you used recently — tables, files, folders, databases, and sheets — and lets you load them directly into the editing canvas. For Spark-heavy teams, though, this is less of a UX tweak and more of a way to stop bleeding time in the first mile of work.

And yes, it’s still a preview feature. Treat it like a mountain route in unstable weather: useful, fast, and not something you trust blindly.

Why Spark teams should care about a Dataflow feature

A lot of Spark teams still frame Dataflow Gen2 as somebody else’s tool. That framing is outdated.

Dataflow Gen2 automatically creates staging Lakehouse and Warehouse items in your workspace. If your team’s workflow includes Dataflow-based ingestion and Spark-based transformation, the handoff between those steps is real. It’s your daily route.

Here’s the hard lesson: if your ingestion layer touches Dataflow Gen2, then UI friction inside Dataflow is your Spark team’s problem too.

What to do about it:

  • Write down your ingestion handoffs in plain language: source to Dataflow Gen2 to staging Lakehouse/Warehouse to Spark notebooks.
  • Mark where engineers repeatedly reconnect to the same sources. That’s where Recent data pays off first.

What Recent data changes under pressure

Recent data does one thing that matters: it remembers specific assets, not just abstract connections.

When you return to a fix, you’re not restarting the expedition from base camp. You get dropped closer to the problem. You can pull the item directly into the editing canvas and keep moving.

For teams, this changes the rhythm of incident response and iteration:

  • You get back to source-level corrections faster.
  • You reduce the chance that someone reconnects to the wrong similarly-named object while moving too fast.
  • You spend less team energy on navigation and more on data correctness.

None of this is glamorous. It’s also exactly where engineering throughput gets won.

Try this: during your next defect cycle, track one metric for a week — time from “issue found” to “source query/table reopened in Dataflow Gen2.” If that number drops after using Recent data, keep leaning in. If it doesn’t, your bottleneck is elsewhere.

What this feature doesn’t rescue you from

Teams love to over-credit new features. Recent data is a navigation accelerator. It’s not governance. It’s not validation. It’s not a replacement for naming discipline. And because it’s in preview, it’s not a foundation for critical operational assumptions.

If your source naming is chaotic, Recent data will surface chaos faster.

If your validation is weak, Recent data will help you ship mistakes sooner.

If your runbooks are vague, Recent data won’t magically teach new engineers what “correct” looks like.

Pair it with a minimum Spark validation pass after ingestion updates: schema check, null expectation, row-count sanity check. Keep this lightweight and repeatable. The point is fast feedback, not ceremony.

Preview discipline: run this like a survival checklist

Because Recent data is in preview, your team should operate with explicit guardrails.

Test in development first. Don’t roll workflow assumptions into production muscle memory before your team has used the feature in real edits.

Keep a source-of-truth map. Recent data is convenience. Your documented source map is control. Keep both.

Standardize names now. If a human can confuse two source objects at a glance, they will. Fix names before speed amplifies mistakes.

Define a fallback path. If the recent list doesn’t have what you need, nobody should improvise. Document the manual reconnect path and keep it current.

Review preview behavior monthly. If the feature behavior shifts while in preview, your team should notice fast and adjust intentionally. Assign one owner for “preview watch” each month. Their job: test the core flow, confirm assumptions still hold, alert the team if anything drifts.

The operating model for Spark leads

If you lead a Spark data engineering team, the decision is straightforward.

Use Recent data. Absolutely use it. But use it like a rope, not like wings.

A rope gets you through rough terrain faster when the team is clipped in, communicating, and following route discipline. Wings are what people imagine they have right before they step into empty air.

In practice:

  • Adopt the feature for speed.
  • Keep your documentation for continuity.
  • Keep naming conventions strict for safety.
  • Keep Spark-side validation for quality.
  • Treat preview status as a real risk signal, not legal fine print.

That combination is where this feature becomes meaningful. Not because it’s flashy. Because it removes repeated friction at exactly the point where your team loses focus, burns time, and compounds small mistakes.

In data engineering, the catastrophic failures usually start as tiny oversights repeated at scale. Recent data removes one class of those oversights — the constant re-navigation tax — but only if you wrap it in disciplined operating habits.

One less avoidable stumble on steep ground, so your team can spend its strength on the parts of the climb that actually require judgment.


This post was written with help from anthropic/claude-opus-4-6

From CDC to Lakehouse: Making Fabric Eventstreams SQL Survive Contact with Production Spark

From CDC to Lakehouse: Making Fabric Eventstreams SQL Survive Contact with Production Spark

Every data team eventually has the same bright idea: “Let’s do CDC so everything is real time.”

What follows is usually less bright.

Somebody wires up connectors, somebody else stands up Kafka, somebody definitely provisions a VM that nobody can later identify, and before long your “modern architecture” has one person who understands it, one person who is afraid of it, and one person who is on call for it. Usually the same person.

So yes, Fabric Eventstreams supporting native CDC connectors for Azure SQL, PostgreSQL, MySQL, and SQL Server sources matters. It removes a lot of plumbing work that used to be mandatory. More importantly, Eventstreams SQL can give you a place to interpret CDC events before they hit your lakehouse and Spark jobs.

That changes the shape of the problem. Not the existence of the problem. Just the shape.

And if you want this to run cleanly at 2:00 AM, the operational details matter more than the architecture diagram.

What Eventstreams SQL actually fixes

Raw CDC events are not analyst-friendly data. They are little envelopes full of intent and drama: insert, update, delete, before image, after image, metadata about the source transaction, and enough ambiguity to start arguments in code review.

If you ship those raw events downstream, every Spark notebook has to interpret them. That means duplicate merge logic and subtle differences between implementations. Two teams can read the same feed and produce slightly different answers. That is how trust in a data platform dies quietly.

Eventstreams SQL can resolve some of those semantics earlier. You can translate event-level changes into cleaner, ready-to-consume records before data lands in destinations.

That can be useful, but it is also where teams start sneaking business logic into the stream layer and then regretting it later.

The bigger question is not just where true merge logic belongs. It is where CDC interpretation belongs at all.

The merge logic decision you cannot avoid

You have two broad options:

  1. Push CDC interpretation upstream into Eventstreams SQL before landing.
  2. Treat Eventstream primarily as a transport layer, land raw or minimally altered CDC into staging, and resolve table semantics in the target engine.

I think option 2 is the better default.

Why? Because once you start doing meaningful CDC interpretation in the stream layer, you now have business logic living in the place that is hardest to reason about, hardest to test, and easiest to forget. You also make it much easier for different downstream systems to drift away from each other.

A cleaner pattern is:

  • use Eventstream for ingestion, routing, and maybe very light filtering
  • land into a staging layer
  • let the target system own merge semantics

That means Azure SQL should own MERGE logic for Azure SQL targets. Lakehouse targets should use Spark or Delta MERGE INTO. The compute engine that owns the table should own the table semantics too.

Trying to make the stream layer do more than that is how teams end up with hidden logic, debugging hell, and architecture diagrams that look cleaner than the actual system.

One important caveat: Eventstreams SQL is not a substitute for Delta MERGE INTO on a Lakehouse table.

Checkpoints: boring, critical, and often broken by accident

Spark Structured Streaming checkpointing is one of those things everybody “knows” until a restart fails and nobody remembers how it works.

Checkpoint locations track stream progress. They are state, not decoration. They are tied to your query plan, and when you change schema or query structure, old checkpoint state may no longer be valid.

This is not an edge case. It is normal lifecycle behavior in evolving pipelines.

Three rules keep you out of trouble:

  • Use distinct checkpoint paths per stream and per target table.
  • Version checkpoint paths when query shape or schema changes.
  • Watch lag between source offsets and committed checkpoint progress.

If you use one checkpoint path for multiple sinks, you are building future pain on purpose. If you change query shape without checkpoint versioning, restart failures are only a matter of timing.

Treat checkpoint migration as a cutover process. Track where old progress stopped, cut to a new checkpoint path intentionally, then retire the previous one once the new job is stable.

The small files problem is not glamorous, but it will hurt you

Most CDC pipelines do not fail dramatically. They fail by becoming slower each week until everyone pretends that 90 seconds is “pretty fast.”

Small files are often the culprit.

CDC streams produce frequent, small increments. Structured Streaming writes micro-batches. Direct lakehouse writes can also produce many tiny files depending on event cadence. Over time, table reads pay the cost in file listing and metadata overhead.

People love to ignore this because compaction feels like janitorial work. It is not. It is core performance engineering.

What works in practice:

  • Repartition before write based on available Spark cores.
  • Partition on-disk by ingestion date, and only add other partition keys when query patterns justify it.
  • Do not partition by operation type. That creates tiny partitions and extra noise.
  • Run regular OPTIMIZE jobs on high-volume CDC tables.

If you are writing through Spark, control file behavior with repartitioning and trigger cadence. A trigger(processingTime='30 seconds') or trigger(processingTime='2 minutes') can reduce file explosion compared with ultra-frequent batches.

If you are using direct Eventstreams-to-Lakehouse writes, accept that you are trading simplicity for less control and schedule compaction accordingly.

The exact maintenance workflow matters less than having one. One-off cleanup is fine when you are exploring, but scheduled maintenance is what keeps tables healthy over time.

Deletes: decide your philosophy before compliance decides for you

In CDC, inserts and updates are easy to reason about. Deletes are where architecture gets emotional.

For analytics, soft deletes are often the sane default: keep the row, mark is_deleted, add deleted_at, preserve history. This keeps downstream trend analysis and audit trails intact.

Hard deletes are different. If compliance requires physical removal, handle that intentionally, usually with batch logic that applies delete events against target Delta tables after landing.

A reliable pattern is:

  1. Stream all CDC events, including deletes, into staging.
  2. Run scheduled jobs that apply physical deletion rules to curated tables.

That keeps streaming simple and pushes irreversible operations into auditable, controllable execution windows.

Could you do something fancier? Probably. Should you, before you need to? Probably not.

Monitoring: minimum viable or maximum regret

A CDC pipeline with no alerting is just a suspense novel written in production.

Your baseline should cover four things:

  • Stream health: is each Structured Streaming query active or terminated?
  • Processing lag: how far are committed offsets behind source offsets?
  • File accumulation: are table file counts growing faster than compaction can handle?
  • Source silence: are you receiving events at all from CDC sources?

That last one matters because “no errors” does not mean “healthy.” If CDC gets disabled during maintenance, your pipeline can fail by producing nothing, which looks calm unless you explicitly monitor for inactivity windows.

Fabric Activator-based alerts can be useful for surfacing threshold breaches. Tie thresholds to actual SLAs, not vibes.

A practical starting playbook

If you are standing this up now, keep it simple:

  1. Enable CDC at the source (sys.sp_cdc_enable_db and sys.sp_cdc_enable_table where applicable).
  2. Validate flow end to end with one real table before scaling breadth.
  3. Segment tables early: simple merge logic in Eventstreams SQL, complex logic in Spark.
  4. Define checkpoint path standards before the first production deploy.
  5. Pick trigger intervals that balance latency with file quality.
  6. Schedule OPTIMIZE from day one, not after performance complaints.
  7. Document merge ownership per table so changes do not become archaeology.

None of this is exotic. That is exactly the point.

Good CDC architecture is usually not a story about cleverness. It is a story about disciplined boring decisions made early, then repeated consistently.

Final take

Fabric Eventstreams plus Spark can give teams a cleaner CDC path than the old connector-plus-consumer patchwork. Native CDC connectors can reduce integration grind. But I would still keep meaningful CDC interpretation and merge behavior in the target compute engine whenever possible. Spark Structured Streaming remains a practical choice for controlled writes and advanced merge behavior.

But the real success criteria are operational.

If you manage checkpoints like real state, control file growth before it controls you, choose a deliberate delete strategy, and wire up monitoring that catches silence as well as failure, this architecture can work well in production.

If you skip those details, it still works right up until the exact moment it doesn’t, which usually happens late, loud, and at the least convenient hour in human history.

That is less a Fabric problem than a production engineering problem. Fabric can simplify parts of the workflow, but it does not remove the need for operational discipline.


This post was written with help from anthropic/claude-opus-4-6

Fabric Spark’s Native Execution Engine: What Speeds Up, What Falls Back, and What to Watch

The Production Migration Checklist for Fabric's Native Execution Engine

You have been running Spark on the JVM for years. It works. Your pipelines finish before the SLA alarm fires, your data scientists get their DataFrames, and you have learned to live with the garbage collector the way one learns to coexist with a roommate who occasionally rearranges all the furniture at 3 AM.

Then Microsoft shipped the Native Execution Engine for Fabric Spark, and the pitch is seductive: swap the JVM’s row-at-a-time processing for a vectorized C++ execution layer built on Meta’s Velox and Apache Gluten, get up to 6x faster query performance on compute-heavy workloads, change zero lines of code, pay nothing extra. Microsoft’s TPC-DS benchmarks at 1 TB scale show roughly 4x improvement over vanilla open-source Spark. Internal Fabric workloads have hit 6x.

Those are real numbers. But “flip the switch and go faster” is a marketing sentence, not an engineering plan. What follows is the checklist your team needs to move production Spark workloads onto the Native Execution Engine without discovering exciting new failure modes at 2 AM on a Tuesday.

Prerequisite Zero: Understand What You Are Opting Into

The Native Execution Engine does not replace Spark. It replaces Spark’s JVM-based physical execution operators — the actual computation — with native C++ equivalents for supported operations. Everything above the physical plan remains untouched: SQL parsing, logical optimization, cost-based rewrites, adaptive query execution, predicate pushdown, column pruning. None of that moves.

Here is the handoff in concrete terms. Spark produces its optimized physical plan as it always has. Apache Gluten intercepts that plan, identifies which operators have native C++ implementations in Velox, and swaps those nodes out. Velox executes them using columnar batches and SIMD instructions, processing 8, 16, or 32 values per CPU instruction instead of iterating row by row through JVM objects.

For operators Velox does not yet support, the engine falls back to standard Spark execution. The transition at the native/JVM boundary requires columnar-to-row and row-to-columnar conversions. These conversions cost real time. A workload that triggers frequent fallbacks can run slower with the engine enabled than without it.

That last sentence matters more than the benchmark numbers. The Native Execution Engine is a selective replacement of physical operators, not a uniform accelerator. Your performance outcome depends on how much of your workload stays in native territory.

Step 1: Confirm You Are on Runtime 1.3

The engine requires Fabric Runtime 1.3 (Apache Spark 3.5, Delta Lake 3.2). Runtime 1.2 support has been discontinued — and here is the dangerous part — silently. If you are still on 1.2, native acceleration is disabled without warning. You will not get an error. You will get no speedup. You will blame the engine rather than your runtime version. Check this first.

Action items:
– Open each Fabric workspace running production Spark workloads
– Navigate to Workspace Settings → Data Engineering/Science → Spark Settings
– Confirm Runtime 1.3 is selected
– If you are on Runtime 1.2, plan the runtime upgrade as a separate migration with its own validation cycle. Spark 3.4 to 3.5 brings behavioral changes unrelated to the native engine, and you do not want to debug two migrations at once

Step 2: Audit Your Workloads

Not every job benefits equally. The engine does its best work on compute-intensive analytical queries — aggregations, joins, filters, projections, complex expressions — over Parquet and Delta data. It adds less to I/O-bound workloads or jobs dominated by Python UDFs that run outside the Spark execution engine entirely.

Build a four-tier inventory:

  • Tier 1 — High-value candidates: Long-running batch ETL with heavy aggregations and joins over Delta tables. These are your biggest CU consumers and your biggest potential beneficiaries. Think: the nightly pipeline that computes vendor aggregates across three years of transaction data, currently consuming 45 minutes of a large cluster.
  • Tier 2 — Likely beneficiaries: Interactive notebooks running analytical queries. Data science feature engineering pipelines that stack transformations before model training.
  • Tier 3 — Uncertain: Workloads using exotic operators, deeply nested struct types, or heavy UDF logic. These need individual testing because you cannot predict fallback behavior from the code alone.
  • Tier 4 — Skip for now: Streaming workloads, jobs dominated by external API calls, or workloads where Python UDF processing accounts for most of the wall-clock time.

Migrate Tier 1 first. You need evidence that the engine delivers measurable wins on your actual workloads before you spend political capital rolling it out everywhere.

Step 3: Create a Non-Production Test Environment

Do not enable the engine on production and hope. Create a dedicated Fabric environment:

  1. In the Fabric portal, create a new Environment item
  2. Navigate to the Acceleration tab
  3. Check Enable native execution engine
  4. Save and Publish

Attach this environment to a non-production workspace. Run your Tier 1 workloads against it using production-scale data. This matters: performance characteristics at 10 GB do not predict behavior at 10 TB, because operator fallback patterns depend on data distributions, not just query structure.

For quick per-notebook testing without a full environment, drop this in your first cell:

%%configure
{
  "conf": {
    "spark.native.enabled": "true"
  }
}


This takes effect immediately — no session restart required — which makes A/B comparisons trivial.

Step 4: Measure Baselines

You cannot prove improvement without a baseline. For each Tier 1 workload, capture:

  • Wall-clock duration from the Spark UI (total job time, not stage time — stage time ignores scheduling and shuffle overhead)
  • CU consumption from Fabric monitoring (this is what your budget cares about)
  • Spark Advisor warnings in the current state, so you can distinguish new warnings from pre-existing noise after enabling native execution
  • Row counts and checksums on output tables — correctness verification requires a pre-migration snapshot

Store these in a Delta table. You will reference them for weeks.

Step 5: Run Native and Watch for Fallbacks

Enable the engine on your test environment and run each Tier 1 workload. Then check two things.

Performance delta: Compare wall-clock time and CU consumption against your baselines. On a genuinely compute-heavy workload, you should see at least 1.5x improvement. If you do not, something is triggering fallbacks and you are paying the columnar-to-row conversion tax without getting the native execution benefit.

Fallback alerts: The Spark Advisor now reports real-time warnings during notebook execution when operators fall back from native to JVM execution. Each alert names the specific operator that could not run natively.

The most common fallback trigger, and the most easily fixed: .show(). This call invokes collectLimit and toprettystring, neither of which has a native implementation. Replace .show() with .collect() or .toPandas() in production code. In a notebook cell you run manually for debugging, it does not matter — but inside a scheduled pipeline, every fallback is a boundary crossing.

Other triggers to watch: unsupported expression types, complex nested struct operations, and certain window function variants. For each one, ask three questions:

  1. Can I rewrite the query to avoid it? Sometimes this is a one-line change. Sometimes it means restructuring a transformation.
  2. Is the fallback on a critical path? A fallback in a logging cell is noise. A fallback inside your core join-and-aggregate chain is a problem.
  3. Is the net performance still positive? If the workload runs 3x faster overall despite two fallback warnings on minor operations, accept the win and move on.

Step 6: Validate Data Correctness

Faster means nothing if the answers change. For each migrated workload:

  • Compare output row counts between native and non-native runs on identical input data
  • Run hash comparisons on key output columns
  • For financial or compliance-sensitive pipelines, do a full row-level diff on a representative partition

The Native Execution Engine preserves Spark semantics, but floating-point arithmetic at boundary conditions, null handling in edge cases, and row ordering in non-deterministic operations all deserve explicit verification on your actual data. Do not skip this step because the TPC-DS numbers looked good. TPC-DS does not have your data shapes.

Step 7: Plan Your Rollback

The best operational property of the Native Execution Engine: it can be disabled per cell, per notebook, per environment, instantly. No restarts. No redeployments.

In PySpark:

spark.conf.set('spark.native.enabled', 'false')


In Spark SQL:

SET spark.native.enabled=FALSE;


Your rollback plan is one line of configuration. But that line only helps if your on-call engineers know it exists. Document it. Add it to your runbook. Add it to the incident response template. The worst production regression is one where the fix takes ten seconds but nobody knows about it for two hours.

Step 8: Roll Out Incrementally

With validation complete, enable the engine in production using one of three strategies, ordered from most cautious to broadest:

Option C — Per-job enablement: Add spark.native.enabled=true to individual Spark Job Definitions or notebook configure blocks. You control exactly which workloads get native execution.

Option A — Environment-level: Navigate to your production Environment → Acceleration tab → enable. All notebooks and Spark Job Definitions using this environment inherit the setting.

Option B — Workspace default: Set your native-enabled environment as the workspace default via Workspace Settings → Data Engineering/Science → Environment. Everything in the workspace picks it up.

Start with Option C on your validated Tier 1 workloads. After a week of stable production runs, graduate to Option A. Option B is for teams that have fully validated their workspace and want blanket coverage.

Step 9: Monitor the First Week

Post-migration monitoring matters because production data is not test data. In the first week:

  • Watch CU consumption trends in Fabric monitoring. Compute-heavy workloads should show measurable drops.
  • Check the Spark Advisor for fallback warnings that did not appear during testing. Different data distributions or code paths in production can trigger different operators.
  • Set alerts on job duration. A sudden increase means a new fallback or regression appeared.
  • Pay attention to any jobs that were borderline in testing. Production-scale data volume can push a workload from “mostly native” to “mostly fallback” if it exercises operators that were uncommon in test data.

Step 10: Optimize for Maximum Native Coverage

Once stable, push further:

  • Replace all .show() calls with .collect() or .display() in scheduled notebook workflows
  • Refactor deeply nested struct operations into flat columnar operations where the query logic allows it
  • Consult the Apache Gluten documentation for the current supported operator list and avoid unsupported expressions in hot paths
  • Keep data in Parquet or Delta format — the engine processes these natively, and other formats require conversion that erases the acceleration
  • For write-heavy workloads, leverage the GA-release native Delta write acceleration, which extends native execution into the output path rather than just the read and transform stages

What Does Not Change

Several things remain identical and need no migration planning:

  • Spark APIs: Your PySpark, Scala, and SQL code is unchanged. The engine operates below the API surface.
  • Delta Lake semantics: ACID transactions, time travel, schema enforcement — all handled by the same Delta Lake 3.2 layer on Runtime 1.3.
  • Cost model: No additional CU charges. Your jobs finish faster, so you consume fewer CUs for the same work. The pricing advantage is indirect but real.
  • Fault tolerance: Spark still manages task retries, stage recovery, and speculative execution. The native engine handles computation; Spark handles resilience.

The Bottom Line

The Native Execution Engine is GA. It runs on the standard Fabric runtime. The performance gains are backed by reproducible benchmarks — up to 4x on TPC-DS at 1 TB, with real-world analytical workloads frequently reaching 6x. It costs nothing to enable and one line of configuration to revert.

But there is a gap between “we turned it on and things got faster” and “we know exactly which workloads got faster, by how much, what fell back, and what to do when something breaks.” The checklist above bridges that gap.

Runtime 1.3. Audit. Baselines. Test. Fallbacks. Correctness. Rollback. Incremental rollout. Monitor. Optimize.

Ten steps. Zero heroics. Measurably faster Spark.

This post was written with help from anthropic/claude-opus-4-6

Open Mirroring + OneLake: Spark patterns that keep latency from eating your weekends

Open Mirroring + OneLake: Spark patterns that keep latency from eating your weekends

Dev is clean. Prod is chaos. In dev, your mirrored table has a cute little dataset and Spark tears through it. In prod, that same notebook starts wheezing like it ran a marathon in wet jeans.

If that sounds familiar, good. You’re not cursed. You’re running into architecture debt that Open Mirroring does not solve for you.

Open Mirroring in Microsoft Fabric does exactly what it says on the tin: it replicates data from external systems into OneLake as Delta tables, and schema changes in the source can flow through. That’s huge. It cuts out a pile of ingestion plumbing.

But mirroring only lands data. It does not guarantee your Spark reads will be fast, stable, or predictable. That’s your job.

This post is the practical playbook: what breaks, why it breaks, and the patterns that keep your Spark jobs from turning into slow-motion disasters.

first principle: mirrored is a landing zone, not a serving layer

Treat mirrored tables like an airport runway. Planes touch down there. People do not set up a picnic on the tarmac.

When teams read mirrored tables directly in hot-path jobs, they inherit whatever file layout the connector produced. Sometimes that layout is fine. Sometimes it is a junk drawer.

Spark is sensitive to this. Reading many tiny files creates scheduling and metadata overhead. Reading a few huge files kills parallelism. Either way, the cluster burns time doing the wrong work.

The fix is boring and absolutely worth it: add a curated read layer.

  1. Let Open Mirroring write into a dedicated mirror lakehouse.
  2. Run a post-mirror notebook that reshapes data for Spark (partitioning, compaction, cleanup).
  3. Have production notebooks read curated tables only.

One extra hop. Much better nights of sleep.

what actually causes the latency cliff

Two things usually punch you in the face at scale:

  • File layout drift
  • Schema drift

Let’s tackle them in order.

1) file layout drift (the silent killer)

Spark scheduling is roughly file-driven for Parquet/Delta scans. That means file shape becomes execution shape. If your table has wildly uneven files, your job speed is set by the stragglers.

Think of ten checkout lanes where nine customers have one item and one customer has a full garage sale cart. Everyone waits on that last lane.

Start by measuring file distribution, not just row counts.

from pyspark.sql import functions as F

# NOTE: inputFiles() returns a Python list of file paths
df = spark.read.format("delta").load("Tables/raw_mirrored_orders")
paths = df.inputFiles()

# Use Hadoop FS to get file sizes in bytes
jvm = spark._jvm
conf = spark._jsc.hadoopConfiguration()
fs = jvm.org.apache.hadoop.fs.FileSystem.get(conf)

sizes = []
for p in paths:
    size = fs.getFileStatus(jvm.org.apache.hadoop.fs.Path(p)).getLen()
    sizes.append((p, size))

size_df = spark.createDataFrame(sizes, ["path", "size_bytes"])

size_df.select(
    F.count("*").alias("file_count"),
    F.round(F.avg("size_bytes")/1024/1024, 2).alias("avg_mb"),
    F.round(F.expr("percentile_approx(size_bytes, 0.5)")/1024/1024, 2).alias("p50_mb"),
    F.round(F.expr("percentile_approx(size_bytes, 0.9)")/1024/1024, 2).alias("p90_mb"),
    F.round(F.max("size_bytes")/1024/1024, 2).alias("max_mb")
).show(truncate=False)


You want a tight-ish band, not chaos. A common rule of thumb is targeting roughly 128 MB to 512 MB Parquet files for balanced throughput and parallelism. Rule of thumb, not religion. Your workload decides final tuning.

Then enforce a sane shape in curated tables:

raw = spark.read.format("delta").load("Tables/raw_mirrored_orders")

(raw.write
    .format("delta")
    .mode("overwrite")
    .partitionBy("order_date")         # choose columns your queries actually filter on
    .option("maxRecordsPerFile", 500000)
    .save("Tables/curated_orders"))

spark.sql("OPTIMIZE delta.`Tables/curated_orders`")


If your queries filter by date and region, but you partition by customer_id because it “felt right,” you built a latency trap with your own hands.

2) schema drift (the 3 a.m. pager)

Open Mirroring can propagate source schema changes. That’s useful and dangerous.

Useful because your lake stays aligned. Dangerous because downstream logic often assumes a fixed shape.

A nullable column addition is usually fine. A type shift on a key metric column can quietly corrupt aggregations or explode at runtime.

Do not “notice this later.” Gate on it.

from pyspark.sql.types import StructType
import json

# Store baseline schema as JSON in Files/schemas/orders_baseline.json
with open("/lakehouse/default/Files/schemas/orders_baseline.json", "r") as f:
    baseline = StructType.fromJson(json.load(f))

current = spark.read.format("delta").load("Tables/raw_mirrored_orders").schema

base = {f.name: str(f.dataType) for f in baseline.fields}
curr = {f.name: str(f.dataType) for f in current.fields}

type_changes = [
    f"{name}: {base[name]} -> {curr[name]}"
    for name in curr
    if name in base and base[name] != curr[name]
]

new_cols = [name for name in curr if name not in base]

if type_changes:
    raise ValueError(f"Schema type changes detected: {type_changes}")

# Optional policy: allow new nullable columns but log them
if new_cols:
    print(f"New columns detected: {new_cols}")


Policy matters more than code here. Decide in advance what is auto-accepted versus what blocks the pipeline. Write it down. Enforce it every run.

lag is real, even when everything is healthy

Mirroring pipelines are replication systems, not teleportation devices. There is always some delay between source commit and mirrored availability. Sometimes tiny. Sometimes not.

If your job blindly processes “last hour” windows without checking mirror freshness, you’ll create holes and call them “data quality issues” three weeks later.

Add a freshness gate before transformations. The metadata source is connector-specific, but the pattern is universal:

from datetime import datetime, timedelta, timezone

# Example only: use the metadata table/view exposed by your mirroring setup
last_mirror_ts = spark.sql("""
  SELECT max(replication_commit_ts) as ts
  FROM mirror_metadata.orders_status
""").collect()[0]["ts"]

required_freshness = datetime.now(timezone.utc) - timedelta(minutes=15)

if last_mirror_ts is None or last_mirror_ts < required_freshness:
    raise RuntimeError(
        f"Mirror not fresh enough. Last commit: {last_mirror_ts}, required after: {required_freshness}"
    )


No freshness, no run. That one line saves you from publishing confident nonsense.

the production checklist (use this before go-live)

Before promoting any mirrored-data Spark pipeline, run this checklist in the same capacity and schedule window as production:

  • File shape check
  • Measure file count and distribution (p50, p90, max).
  • If distribution is ugly, compact and rewrite in curated.
  • Partition sanity check
  • Confirm partitions match real filter predicates.
  • Use df.explain(True) and verify PartitionFilters is not empty for common queries.
  • Schema gate check
  • Compare current schema to baseline.
  • Fail on type changes unless explicitly approved.
  • Freshness gate check
  • Validate mirrored data is fresh enough for your downstream SLA.
  • Fail fast if not.
  • Throughput reality check
  • Time representative full and filtered scans from curated tables.
  • If runtime misses SLA, fix layout first, then tune compute.

If you only do one thing from this list, do the curated layer. Direct reads from mirrored tables are the root of most performance horror stories.

architecture that holds up when volume gets ugly

Keep it simple:

  1. Mirror layer
    Open Mirroring lands source data in OneLake Delta tables. This is your raw replica.

  2. Curation job
    A scheduled Spark notebook validates schema, reshapes partitions, and compacts files.

  3. Curated layer
    Downstream Spark notebooks and SQL consumers read here, not from mirror tables.

  4. Freshness gate
    Every downstream run checks replication freshness before processing.

That’s it. No heroics. No mystery knobs. Just a clean boundary between “data landed” and “data is ready to serve.”

Open Mirroring is genuinely powerful, but it is not magic. If you treat mirrored tables as production-ready serving tables, latency will eventually kneecap you. If you treat them as a landing zone and curate aggressively, Spark behaves, stakeholders stay calm, and your weekends stay yours.

This post was written with help from anthropic/claude-opus-4-6

What “Execute Power Query Programmatically” Means for Fabric Spark Teams

Somewhere in a Fabric workspace right now, two teams are maintaining the same transformation twice.

The BI team owns it in Power Query. The Spark team rewrote it in PySpark so a notebook could run it on demand. Both versions work. Both versions drift. Both versions break at different times.

That was normal.

Microsoft’s new Execute Query API (preview) is the first real shot at ending that duplication. It lets you execute Power Query (M) through a public REST API from notebooks, pipelines, or any HTTP client, then stream results back in Apache Arrow format.

For Spark teams, this isn’t a minor feature. It changes where transformation logic can live.

What actually shipped

At a technical level, the API is simple:

  • Endpoint: POST /v1/workspaces/{workspaceId}/dataflows/{dataflowId}/executeQuery
  • Input: a queryName, with optional customMashupDocument (full M script)
  • Output: Arrow stream (application/vnd.apache.arrow.stream)

The execution context comes from a Dataflow Gen2 artifact in your workspace. Its configured connections determine what data sources the query can access and which credentials are used.

That single detail matters more than it looks. You’re not just “calling M from Spark.” You’re running M under dataflow-governed connectivity and permissions.

Why Spark engineers should care

Before this API, Spark teams usually had two options:

  • Rewrite M logic in PySpark
  • Or wait for a dataflow refresh and consume the output later

Neither is great. Rewrites create long-term maintenance debt. Refresh handoffs add latency and orchestration fragility.

Now you can execute the transformation inline and keep moving.

A minimal call path looks like this:

import requests
import pyarrow as pa

response = requests.post(url, headers=headers, json=request_body, stream=True)

with pa.ipc.open_stream(response.raw) as reader:
    pandas_df = reader.read_pandas()

spark_df = spark.createDataFrame(pandas_df)


No CSV hop. No JSON schema drift. No custom parsing layer.

The non-negotiable constraints

This feature is useful, but it is not magic. There are hard boundaries.

  1. 90-second timeout
    – Query evaluations must complete within 90 seconds.
    – This is ideal for fast lookups, enrichment, and reference joins—not heavy batch reshaping.

  2. Read-only execution
    – The API executes queries only. It doesn’t support write actions.
    – If your notebook flow assumes “query + write” in one API step, redesign it.

  3. Native query rule for custom mashups
    customMashupDocument does not allow native database queries.
    – But if a query defined inside the dataflow itself uses native queries, that query can be executed.
    – This distinction will trip people if they treat inline M and stored dataflow queries as equivalent.

  4. Performance depends on folding and query complexity
    – Bad folding or expensive transformations can burn your 90-second window quickly.
    – You need folding-aware query reviews before production rollout.

Practical rollout plan for Spark teams

If you lead a Fabric Spark team, do this in order.

1) Inventory duplication first

Build a short list of transformations currently duplicated between M and PySpark. Start with transformations that are stable, reused often, and mostly read-oriented.

2) Stand up a dedicated execution dataflow

Create one Dataflow Gen2 artifact specifically for API-backed execution contexts.

  • Keep connections explicit and reviewed
  • Restrict who can modify those connections
  • Treat the artifact like infrastructure, not ad hoc workspace clutter

3) Wrap Execute Query behind one notebook utility

Don’t let every notebook hand-roll HTTP logic. Create one shared helper that handles:

  • token acquisition
  • request construction
  • Arrow stream parsing
  • error handling
  • timeout/response logging

If the API returns 202 (long-running operation), honor Location and Retry-After instead of guessing polling behavior.

4) Add governance checks before scale

Because execution runs under dataflow connection scope, validate:

  • who can execute
  • what connections they indirectly inherit
  • which data sources become reachable through that path

If your governance model assumes notebook identity is the only control plane, this API changes that assumption.

5) Monitor capacity from day one

Microsoft surfaces this usage in Capacity Metrics as “Dataflows Gen2 Run Query API”, billed on the same meter family as Dataflow Gen2 refresh operations. Watch this early so you don’t discover new spend after adoption is already wide.

Where this fits (and where it doesn’t)

Use it when you need:

  • shared transformation logic between BI and engineering
  • fast, read-oriented query execution from Spark/pipelines/apps
  • connector and gateway reach already configured in dataflows

Avoid it when you need:

  • long-running transformations
  • write-heavy jobs
  • mission-critical production paths with zero preview risk tolerance

The REST API docs still mark this as preview and “not recommended for production use.” Treat that warning as real, not ceremonial.

The organizational shift hiding behind the API

The technical win is straightforward: fewer rewrites, faster integration, cleaner data handoffs.

The harder change is social.

When Spark notebooks can directly execute M, ownership lines between BI and data engineering need to be explicit. Who owns business logic? Who owns runtime reliability? Who approves connection scope?

Teams that answer those questions early will move fast.

Teams that don’t will just reinvent the same duplication problem with a new endpoint.


Source notes

This post was written with help from anthropic/claude-opus-4-6.

What the February 2026 Fabric Influencers Spotlight means for your Spark team

What the February 2026 Fabric Influencers Spotlight means for your Spark team

Microsoft published its February 2026 Fabric Influencers Spotlight last week. Twelve community posts. MVPs and Super Users. Most people skim the list. Maybe bookmark a link. Move on.

Don’t.

Three of those posts carry signals that should change how your Spark data-engineering team operates in production. Not next quarter. Now.

Signal 1: Get your production code out of notebooks

Matthias Falland’s Fabric Friday episode makes the case plainly: notebooks are great for development but risky in production. That framing resonates with a lot of production teams—and for good reason.

Here’s the nuance. Microsoft has said there’s no inherent difference in performance or monitoring capabilities between Spark Job Definitions and notebooks. Both produce Spark logs. Both run on the same compute. The gap isn’t in what the platform offers. It’s in what each artifact encourages.

Notebooks encourage improvisation. Someone edits a cell at 2 AM. Cell state carries between runs. An error gets swallowed inside an output cell and nobody notices until downstream tables go stale. That’s not a platform limitation. That’s a human-factors problem. And production environments are where human-factors problems become outages.

Spark Job Definitions push you toward cleaner habits. One file per job. No cell state. Explicit parameters. Better modularity. The execution boundary is sharper, and sharper boundaries make failures easier to diagnose.

If your team runs notebooks on a schedule through pipelines, here’s the migration:

  • Audit every notebook that runs on a schedule or gets triggered by a pipeline. Count them. You’ll be surprised.
  • Extract the transformation logic into standalone Python or Scala files. One file per job. No magic. No “run all cells.”
  • Create Spark Job Definitions for each. Map your existing notebook parameters to SJD parameters. They work the same way—just without the cell baggage.
  • Wire them into your pipeline activities. Replace the notebook activity with an SJD activity. The orchestration stays identical.
  • Keep the notebooks for development and ad-hoc exploration. That’s where they shine.

A team of three can typically convert a dozen notebooks in a week. The hard part isn’t the migration. It’s the decision to start.

Signal 2: Direct Lake changes how you write to your lakehouse

Pallavi Routaray’s post on Direct Lake architecture is the most consequential piece in the whole spotlight. Easy to miss because the title sounds like a Power BI topic.

It’s not. It’s a Spark topic.

Direct Lake mode reads Parquet files directly from OneLake. No import copy. No DirectQuery overhead. But it only works well if your Spark jobs write data in a way that Direct Lake can consume efficiently. Get the file layout wrong and your semantic model falls back to DirectQuery silently. Performance craters. Your BI team blames you. Nobody knows why.

Here’s the production checklist:

  • Enable V-Order optimization on your Delta tables. V-Order sorts and compresses Parquet files for Direct Lake’s columnar read path. Here’s the catch: V-Order is disabled by default in new Fabric workspaces, optimized for write-heavy data engineering workloads. If your workspace was created recently, you need to enable it explicitly. Check your workspace settings—or set it at the table property level. Don’t assume it’s on.
  • Control your file sizes. Microsoft’s guidance is clear: keep the number of Parquet files small and use large row groups. If your Spark jobs produce thousands of tiny files, Direct Lake will hit its file-count limits and fall back. Run OPTIMIZE on your Delta tables after write operations. Compact aggressively.
  • Partition deliberately. Over-partitioning creates too many small files. Under-partitioning creates files that are too large for efficient column pruning. Partition by the grain your BI team actually filters on. Ask them. Don’t guess.
  • Watch for schema drift. Direct Lake models bind to specific columns at creation time. If your Spark job adds or renames a column, the semantic model breaks. Coordinate schema changes explicitly. No silent ALTER TABLE commands on Friday afternoons.

The big risk here: most Spark teams don’t know their output feeds a Direct Lake model. The BI team built it after the fact. Start by mapping which of your Delta tables have Direct Lake semantic models sitting on top. If you don’t know, find out today.

Signal 3: CI/CD for Fabric just got real

Kevin Chant’s post covers the fabric-cicd tool reaching general availability for configuration-based deployments with Azure DevOps. This is verified and it matters more than it sounds.

Until now, deploying Fabric artifacts across environments—dev, test, prod—was either manual or held together with custom scripts that broke every time the API changed. The fabric-cicd tool gives you a supported, versioned path.

For Spark teams:

  • Your Spark Job Definitions, lakehouse configurations, and pipeline definitions can live in source control and deploy through a proper pipeline. No more “I’ll just update it in the portal.”
  • Configuration differences between environments—connection strings, capacity settings, lakehouse names—get handled through configuration files. Not by editing items in the portal after deployment.
  • You can roll back. You can diff. You can review before promoting to production. The basic hygiene that every other engineering discipline has had for decades.

Here’s the migration path:

  • Install fabric-cicd from the latest release. Follow Chant’s posts for the Azure DevOps YAML pipeline specifics.
  • Export your existing workspace items to a Git repository. Fabric’s Git integration handles this natively.
  • Build your environment-specific configuration files. One per environment. Map the items that differ: capacity, lakehouse, connections.
  • Set up your Azure DevOps pipeline to run fabric-cicd on merge to main. Start with dry-run mode until you trust it.
  • Remove portal-level edit access for production workspaces. This is the hard step. It’s also the one that prevents the next outage.

The deeper pattern

These three signals connect. Falland tells you to move your Spark code into artifacts built for production discipline. Routaray tells you how to write your output so downstream models don’t silently degrade. Chant tells you how to deploy the whole thing reliably across environments.

That’s a production pipeline. End to end. Code that runs cleanly, writes data correctly, and deploys safely.

The February spotlight also includes Open Mirroring hands-on guidance from Inturi Suparna Babu and a Fabric Data Agent walkthrough from Shubham Rai. Both are worth a read if you’re evaluating data replication strategies or AI-assisted query patterns over your lakehouse. But for Spark teams running production workloads, the three signals above are where the action is.

Your rollout checklist for March

  1. Inventory all scheduled notebooks. Tag them by risk: frequency, data volume, downstream dependencies.
  2. Convert the highest-risk notebook to a Spark Job Definition this week. Validate it runs identically.
  3. Audit Delta table write patterns for any table that feeds a Direct Lake model. Check that V-Order is enabled. Run OPTIMIZE to compact files.
  4. Install fabric-cicd. Connect your workspace to Git. Build your first environment config.
  5. Pick one pipeline to deploy through CI/CD end-to-end. Prove it works before scaling.

Five items. All concrete. All doable in March.

The community did the research. Your job is to act on it.

This post was written with help from anthropic/claude-opus-4-6

Keeping Spark, OneLake, and Mirroring Reliable in Microsoft Fabric

The alert fired at 2:14 AM on a Tuesday. A downstream Power BI report had gone stale — the Direct Lake dataset hadn’t refreshed in six hours. The on-call engineer opened the Fabric monitoring hub and found a cascade: three Spark notebooks had completed without triggering downstream freshness checks, a mirrored database was five hours behind, and the OneLake shortcut connecting them was returning intermittent 403 errors. It went undetected until a VP’s morning dashboard showed yesterday’s numbers.

That scenario is stressful, but it’s also solvable. These issues are usually about observability gaps between services, not broken fundamentals. If you’re running Spark workloads against OneLake with mirroring in Microsoft Fabric, you’ll likely encounter some version of this under real load. The key is having an operational playbook before it happens.

What follows is that playbook — assembled from documented production incidents, community post-mortems, and repeatable operating patterns from teams running this architecture at scale.

How Spark, OneLake, and mirroring connect (and where they don’t)

The dependency chain matters because issues can cascade through it in non-obvious ways.

Your Spark notebooks write Delta tables to OneLake lakehouses. Those tables might feed Direct Lake datasets in Power BI. Separately, Mirroring can replicate data from external sources — Azure SQL Database, Cosmos DB, Snowflake, and others — into OneLake as Delta tables. Shortcuts bridge lakehouses or reference external storage.

What makes this operationally nuanced: each layer has its own retry logic, auth tokens, and completion semantics. A Spark job can succeed from its own perspective (exit code 0, no exceptions) while the data it wrote is temporarily unavailable to downstream consumers because of a metadata sync delay. Mirroring can pause during source throttling and may not raise an immediate alert unless you monitor freshness directly. Shortcuts can go stale when target workspace permissions change.

You can end up with green pipelines and incomplete data. The gap between “the job ran” and “the data arrived correctly” is where most reliability work lives.

Detection signals you actually need

The first mistake teams make is relying on Spark job status alone. A job that completes successfully but writes zero rows, hits an unmonitored schema drift, or writes to the wrong partition is still a data quality issue.

Here’s what to watch instead:

Row count deltas. After every notebook run, compare the target table’s row count against expected intake. It doesn’t need to be exact — a threshold works. If the delta table grew by less than 10% of its average daily volume, fire a warning. Three lines of Spark SQL at the end of your notebook. Five minutes to implement. It prevents empty-table surprises at 9 AM.

OneLake file freshness. The _delta_log folder in your lakehouse tables contains JSON commit files with timestamps. If the most recent commit is older than your pipeline cadence plus a reasonable buffer, investigate. A lightweight monitoring notebook that scans these timestamps across key tables takes about twenty minutes to build.

Mirroring lag via canary rows. The monitoring hub shows mirroring status, but the granularity is coarse. For external databases, set up a canary: a table in your source that gets a timestamp updated every five minutes. Check that timestamp on the OneLake side. If the gap exceeds your SLA, you know mirroring is stalled before your users do.

Shortcut health checks. Shortcuts can degrade quietly when no direct check exists. A daily job that reads a single row from each shortcut target and validates the response catches broken permissions, expired SAS tokens, and misconfigured workspace references before they cause real damage.

Failure mode 1: the Spark write that succeeds but isn’t queryable yet

You’ll see this in Fabric notebook logs as a clean run. The Spark job processed data, performed transformations, called df.write.format("delta").mode("overwrite").save(). Exit code 0. But the data isn’t queryable from the SQL analytics endpoint, and Direct Lake still shows stale numbers.

What happened: the SQL analytics endpoint runs a separate metadata sync process that detects changes committed to lakehouse Delta tables. According to Microsoft’s documentation, under normal conditions this lag is less than one minute. But it can occasionally fall behind — sometimes significantly. The Fabric community has documented sync delays stretching to hours, particularly during periods of high platform load or when tables have large numbers of partition files.

This is the gap that catches teams off guard. The Delta commit landed in storage, but the SQL endpoint hasn’t picked it up yet.

Triage sequence:

  1. Open the lakehouse in Fabric and check the table directly via the lakehouse explorer. If the data appears there but not in the SQL endpoint, you’ve confirmed a metadata sync lag.
  2. Check Fabric capacity metrics. If your capacity is throttled (visible in the admin portal under capacity management), metadata sync can be deprioritized. Burst workloads earlier in the day can surface as sync delays later.
  3. Force a manual sync. In the SQL analytics endpoint, select “Sync” from the table context menu. You can also trigger this programmatically — Microsoft released a Refresh SQL Analytics Endpoint Metadata REST API (preview as of mid-2025), and it’s also available through the semantic-link-labs Python package.

Remediation: Add a post-write validation step to your notebooks. After writing the Delta table, wait 30 seconds, then query the SQL analytics endpoint for the max timestamp or row count. If it doesn’t match what you wrote, log a warning and retry the sync. If after three retries it still diverges, fail the pipeline explicitly so your alerting catches it. Don’t let a successful Spark job mask a downstream data gap.

Failure mode 2: mirroring goes quiet

Mirroring is genuinely useful for getting external data into OneLake without building custom pipelines. But one common reliability pattern is that replication can stall when the source system throttles or times out, and the monitoring hub may still show “Running” while data freshness drifts.

This pattern is often observed with Azure SQL Database sources during heavy read periods. The mirroring process opens change tracking connections that compete with production queries. When the source database gets busy, it can throttle the mirroring connection, and Fabric retry logic may back off for extended periods without immediately surfacing a hard error.

Triage sequence:

  1. Check mirroring status in the monitoring hub, but prioritize the “Last synced” timestamp over the status icon. “Running” with a last-sync time of four hours ago still indicates a problem.
  2. Check the source database’s connection metrics. If you’re mirroring from Azure SQL, look at DTU consumption and connection counts around the time mirroring lag increased. There’s often a correlation with a batch job or reporting burst.
  3. Inspect table-level mirroring status. Individual tables can fall behind while others sync normally. The monitoring hub aggregates this, which can hide partial lag.

Remediation: The canary-row pattern is your early warning system. For prevention, stagger heavy source-database workloads away from mirroring windows. If your Azure SQL is Standard tier, increasing DTU capacity or moving to Hyperscale gives mirroring more room. On the Fabric side, stopping and restarting mirroring resets the connection and forces a re-sync when retry backoff has become too aggressive.

Failure mode 3: shortcut permissions drift

Shortcuts are the connective tissue of OneLake — references across lakehouses, workspaces, and external storage without copying data. They deliver huge flexibility, but they benefit from explicit permission and token hygiene.

A common failure pattern: a shortcut that worked for months suddenly returns 403 errors or empty results. Spark notebooks that read from the shortcut either fail with ADLS errors or complete with zero rows if downstream checks aren’t strict.

Root causes, ranked by observed frequency in the field:

  1. A workspace admin changed role assignments, and the identity the shortcut was created under lost access. Usually accidental.
  2. For ADLS Gen2 shortcuts: the SAS token expired, or storage account firewall rules changed.
  3. Cross-tenant shortcuts relying on Entra ID B2B guest access. If guest policy changes on either tenant, shortcuts can break without a prominent Fabric notification.

Triage sequence:

  1. Open the shortcut definition in the lakehouse — Fabric shows a warning icon on broken shortcuts, but only in the lakehouse explorer.
  2. Test the shortcut target independently. Can you access the target lakehouse or storage account directly with the same identity? If not, it’s a permissions issue.
  3. For ADLS shortcuts, check storage account access logs in Azure Monitor. Look for 403 responses from Fabric service IP ranges.

Remediation: Use service principals with dedicated Fabric permissions rather than user identities for shortcuts. Set up a token rotation calendar with 30-day overlap between old and new tokens so you’re never caught by a hard expiration. Then keep a daily shortcut health-check job that reads one row from each shortcut target and validates expected row counts.

Failure mode 4: capacity throttling disguised as five different problems

This one is tricky because it can look like unrelated issues at once. Spark jobs slow down. Metadata syncs lag. Mirroring falls behind. SQL endpoint queries time out. Power BI reports go stale. Troubleshoot each symptom in isolation and you’ll end up looping.

The common thread: your Fabric capacity hit its compute limits and started throttling. Fabric uses a bursting and smoothing model — you can temporarily exceed your purchased capacity units, but that overuse gets smoothed across future time windows. The system recovers by throttling subsequent operations. A heavy Spark job at 10 AM can degrade Power BI performance at 3 PM unless capacity planning accounts for that delayed impact.

Triage sequence:

  1. Open the capacity admin portal and look at the CU consumption graph. Sustained usage above 100% followed by throttling bands is your signal.
  2. Identify top CU consumers. Spark notebooks and materialization operations (Direct Lake refreshes, semantic model processing) tend to be the heaviest. Capacity metrics break this down by workload type.
  3. Check the throttling policy and current throttling state. Fabric throttles interactive workloads first when background usage exceeds limits — meaning end users feel pain from batch jobs they never see.

Remediation: Separate workloads by time window. Push heavy Spark processing to off-peak hours. If you can’t shift the schedule, split workloads across multiple capacities — batch on one, interactive analytics on another. Set CU consumption alerts at 80% of capacity so you get warning before throttling starts.

For bursty Spark demand, also evaluate Spark Autoscale Billing. In the current Fabric model, Autoscale Billing is opt-in per capacity and runs Spark on pay-as-you-go serverless compute, so Spark jobs don’t consume your fixed Fabric CU pool. That makes it a strong option for ad-hoc spikes or unpredictable processing windows where manual SKU up/down management is too slow.

If your workload is predictable, pre-scaling SKU windows (for example, F32 to F64 before a known processing block) can still be effective — just manage cost guardrails and rollback timing tightly.

Assembling the runbook

A playbook works only if it’s accessible and actionable when the alert fires at 2 AM. Here’s how to structure it:

Tier 1 — automated checks (every pipeline cycle):
– Post-write row count validation in every Spark notebook
– Canary row freshness for every mirrored source
_delta_log timestamp scan across key tables

Tier 2 — daily health checks (scheduled monitoring job):
– Shortcut validation: read one row from every shortcut target
– Capacity CU trending: alert if 7-day rolling average exceeds 70%
– Mirroring table-level lag report (not just aggregate status)

Tier 3 — incident response (when alerts fire):
– Start with capacity metrics. If throttling is active, it’s often the shared root cause behind multi-symptom incidents.
– Check mirroring “Last synced” timestamps. Don’t rely on status icons alone.
– For Spark write issues, verify SQL endpoint sync state independently from the Delta table itself.
– For shortcut errors, test target identity access directly outside of Fabric.

Fabric gives you powerful primitives: Spark at scale, OneLake as a unified data layer, and mirroring that removes a lot of custom ingestion plumbing. With cross-service monitoring and a practical runbook, these patterns become manageable operational events instead of recurring surprises.

This post was written with help from anthropic/claude-opus-4-6