ExtractLabel just changed how your Spark pipelines should handle unstructured data

Every data engineer eventually inherits the same cursed pipeline.

Upstream sends you a blob of human text. Somewhere in that blob are the exact facts your downstream systems need: product name, issue category, requested resolution, timeline, who did what, and when. The facts are there. They are just buried in prose written by sleep-deprived humans, copied from emails, and occasionally typed from a phone in an airport parking lot.

For years, we handled this with a pile of hacks:

Regex that works until one user adds a comma
Hand-rolled NER that drifts quietly into uselessness
LLM prompts that return valid JSON on Monday, improv theater on Tuesday

Then we pretend this is fine by writing 300 lines of “normalization” code downstream, plus defensive checks, plus retry logic, plus enough if statements to make your future self hate your past self.

That is the old world.

ExtractLabel is the first Fabric AI Functions primitive that treats extraction like a contract instead of a vibe. You define the shape once in JSON Schema. The extraction step returns that shape. Your pipeline gets predictable structure instead of model improv.

If you run Spark workloads in Fabric, this matters immediately.

What AI Functions already gave you (and where it fell short)

Before ExtractLabel, the quick path looked like this:

df["text"].ai.extract("name", "profession", "city")

For exploration, that is great. For production, it is a trap.

Prototype extraction asks, “Can the model find useful fields?”
Production extraction asks, “Can every downstream consumer trust type, shape, and vocabulary every single run?”

Those are different questions.

The basic label call is lightweight and convenient, but it leaves the hardest part unsolved: schema discipline. If your routing logic expects one of four categories, free-form output creates entropy. If your analytics expects arrays, and extraction returns comma-separated strings, you are writing cleanup code forever. If optional fields are not explicitly nullable, models tend to fill blanks with plausible nonsense.

The model understanding was never the bottleneck. Contract reliability was.

ExtractLabel: the schema contract your pipeline needs

ExtractLabel gives you an explicit schema boundary between unstructured input and structured output. In pandas you import from synapse.ml.aifunc; in PySpark you import from synapse.ml.spark.aifunc. The core pattern is the same: define one label with object properties, requirements, and constraints.

Concrete example, using warranty claims:

from synapse.ml.aifunc import ExtractLabel  claim_schema = ExtractLabel(     label="claim",     max_items=1,     type="object",     description="Extract structured warranty claim information",     properties={         "type": "object",         "properties": {             "product_name": {"type": "string"},             "problem_category": {                 "type": "string",                 "enum": ["defect", "damage_in_transit", "missing_part", "other"],                 "description": "defect=stopped working or malfunctioning, damage_in_transit=arrived damaged, missing_part=something not included"             },             "problem_summary": {                 "type": "string",                 "description": "Max 20 words. Summarize the core issue."             },             "time_owned": {"type": ["string", "null"]},             "troubleshooting_tried": {                 "type": "array",                 "items": {"type": "string"}             },             "requested_resolution": {                 "type": "string",                 "enum": ["replacement", "refund", "repair", "other"]             }         },         "required": ["product_name", "problem_category", "problem_summary",                      "time_owned", "troubleshooting_tried", "requested_resolution"],         "additionalProperties": False     } )  df[["claim"]] = df["text"].ai.extract(claim_schema)

Input text:

“The smart thermostat stopped turning on after 12 days. I tried a reset and new batteries. Please replace it.”

Structured output:

{     "product_name": "smart thermostat",     "problem_category": "defect",     "problem_summary": "Thermostat stopped turning on after 12 days",     "time_owned": "12 days",     "troubleshooting_tried": ["reset", "new batteries"],     "requested_resolution": "replacement" }

That is the difference: you are no longer extracting “some fields.” You are producing an object your systems can rely on.

The five schema features that actually matter

Most teams will over-focus on “LLM extraction” and under-focus on schema design. That is backwards. The model is only half the system. The schema is what makes it production-safe.

1) Nullable types

Use explicit nullable definitions for fields that may not exist in the source text:

"time_owned": {"type": ["string", "null"]}

If you do not allow null, the model is pressured to invent. Nullable fields reduce that pressure.

2) Enums for category control

When downstream logic expects bounded values, enforce them with enum.

That turns category assignment from fuzzy language output into controlled vocabulary. If your pipeline routes by problem_category, this is non-negotiable.

3) Arrays for true multi-value extraction

If a claim can include multiple troubleshooting actions, represent it as an array. Do not accept packed strings and split later.

Array semantics belong in extraction, not in cleanup jobs.

4) Descriptions as extraction instructions

Descriptions are not decorative comments. They are guidance for the extraction step.

Use them to define edge behavior, clarify enum intent, and enforce concise summaries. Most quality gains come from this field, not from prompt wording elsewhere.

5) Nested objects for real-world structure

Complex payloads are rarely flat. If your domain includes sub-entities, model them as nested objects now. Flattening everything into top-level strings feels easier in week one and becomes technical debt by week six.

What this means for your Spark pipelines right now

If your team already runs text extraction in Fabric pipelines, ExtractLabel gives you a clean migration path with immediate payback in reliability.

Practical rollout plan:

Find the pain first. Audit extraction steps where downstream code spends time repairing output shape, casing, and categories. Those are your highest-ROI migrations.
Version schemas like code. Store schema definitions in source control with explicit version tags. Treat schema changes as contract changes, not casual edits.
Use one extraction contract per domain task. Do not build one giant universal schema. Warranty claims, support tickets, and contract clauses deserve separate schemas with domain-specific enums and guidance.
Prefer model-based schema authoring as complexity grows. Once schemas get large, hand-editing JSON gets brittle. Define structures in typed Python models and generate JSON Schema from there. You get stronger review discipline and fewer silent mistakes.
Build an evaluation harness before broad rollout. ExtractLabel enforces structure; it does not guarantee semantic correctness. Keep a labeled sample set, score extraction quality regularly, and review drift.
Tune operational settings with real workload telemetry. Concurrency, retry behavior, and throughput limits should be validated in your environment, not assumed from defaults. Measure error columns and latency under realistic load before declaring victory.

Verify runtime, capacity, and governance prerequisites against current Fabric documentation in your tenant before rollout. Platform details move. Your production runbooks should not rely on stale assumptions.

Migration risks worth thinking about

ExtractLabel is strong, but this is still LLM-powered extraction. You need grown-up operating discipline.

Model behavior drift

Even with stable schema shape, semantic interpretation can shift over time. A phrase that mapped to defect last month might map to other after a model update.

Mitigation: maintain a regression set and run periodic quality checks. Contract shape is necessary. Accuracy monitoring is mandatory.

Cost surprises at volume

Row-wise AI extraction scales linearly with data volume. Teams underestimate this, then panic when ingestion spikes.

Mitigation: test on representative daily volume, not a toy sample. Budget for peak days, not median days.

Schema evolution pain

You will add fields. You will split categories. You will regret one enum name. That is normal.

Mitigation: include schema version metadata in outputs and plan how downstream consumers handle mixed historical versions.

False confidence from “valid JSON”

Teams see valid typed output and stop questioning semantics. That is how bad extractions get into trusted dashboards.

Mitigation: sample manually, review periodically, and keep humans in the QA loop for high-impact fields.

When to use ExtractLabel vs. other approaches

Use ExtractLabel when all of these are true:

Input is unstructured text
Output must be typed and schema-conforming
You need extraction embedded in Fabric data workflows

Keep regex when the task is deterministic and mechanical (IDs, fixed-format dates, known token patterns).

Keep specialized NER pipelines when domain vocabulary is unusual, latency requirements are strict, or inference cost constraints are severe.

Use document-native extraction tools when layout matters (forms, scans, tables in images/PDFs). Text-column extraction will not recover geometry it never saw.

If your instinct is “we can just prompt harder,” stop. That is how you build a fragile system that passes demos and fails operations.

The bottom line

ExtractLabel moves Fabric extraction from improvisation to contracts.

The shiny part is one line of code:

df[["claim"]] = df["text"].ai.extract(claim_schema)

The valuable part is everything you encode in the schema: allowed values, nullability, nested structure, and descriptive guidance for edge cases.

Do that work once, and your downstream pipeline stops behaving like a cleanup crew.

Less duct tape, more reliable data.

This post was written with help from anthropic/claude-opus-4-6

What “Recent data” in Fabric means for Spark teams when time is the real bottleneck

At 8:07 a.m., nobody on a data engineering team is debating architecture purity. You’re trying to get back to the exact source you were fixing yesterday before another downstream notebook fails and somebody asks for an ETA.

That’s the problem Microsoft Fabric’s Recent data feature targets.

The feature landed in the February 2026 Fabric update and is currently in preview. It sounds small: Dataflow Gen2 remembers the specific items you used recently — tables, files, folders, databases, and sheets — and lets you load them directly into the editing canvas. For Spark-heavy teams, though, this is less of a UX tweak and more of a way to stop bleeding time in the first mile of work.

And yes, it’s still a preview feature. Treat it like a mountain route in unstable weather: useful, fast, and not something you trust blindly.

Why Spark teams should care about a Dataflow feature

A lot of Spark teams still frame Dataflow Gen2 as somebody else’s tool. That framing is outdated.

Dataflow Gen2 automatically creates staging Lakehouse and Warehouse items in your workspace. If your team’s workflow includes Dataflow-based ingestion and Spark-based transformation, the handoff between those steps is real. It’s your daily route.

Here’s the hard lesson: if your ingestion layer touches Dataflow Gen2, then UI friction inside Dataflow is your Spark team’s problem too.

What to do about it:

Write down your ingestion handoffs in plain language: source to Dataflow Gen2 to staging Lakehouse/Warehouse to Spark notebooks.
Mark where engineers repeatedly reconnect to the same sources. That’s where Recent data pays off first.

What Recent data changes under pressure

Recent data does one thing that matters: it remembers specific assets, not just abstract connections.

When you return to a fix, you’re not restarting the expedition from base camp. You get dropped closer to the problem. You can pull the item directly into the editing canvas and keep moving.

For teams, this changes the rhythm of incident response and iteration:

You get back to source-level corrections faster.
You reduce the chance that someone reconnects to the wrong similarly-named object while moving too fast.
You spend less team energy on navigation and more on data correctness.

None of this is glamorous. It’s also exactly where engineering throughput gets won.

Try this: during your next defect cycle, track one metric for a week — time from “issue found” to “source query/table reopened in Dataflow Gen2.” If that number drops after using Recent data, keep leaning in. If it doesn’t, your bottleneck is elsewhere.

What this feature doesn’t rescue you from

Teams love to over-credit new features. Recent data is a navigation accelerator. It’s not governance. It’s not validation. It’s not a replacement for naming discipline. And because it’s in preview, it’s not a foundation for critical operational assumptions.

If your source naming is chaotic, Recent data will surface chaos faster.

If your validation is weak, Recent data will help you ship mistakes sooner.

If your runbooks are vague, Recent data won’t magically teach new engineers what “correct” looks like.

Pair it with a minimum Spark validation pass after ingestion updates: schema check, null expectation, row-count sanity check. Keep this lightweight and repeatable. The point is fast feedback, not ceremony.

Preview discipline: run this like a survival checklist

Because Recent data is in preview, your team should operate with explicit guardrails.

Test in development first. Don’t roll workflow assumptions into production muscle memory before your team has used the feature in real edits.

Keep a source-of-truth map. Recent data is convenience. Your documented source map is control. Keep both.

Standardize names now. If a human can confuse two source objects at a glance, they will. Fix names before speed amplifies mistakes.

Define a fallback path. If the recent list doesn’t have what you need, nobody should improvise. Document the manual reconnect path and keep it current.

Review preview behavior monthly. If the feature behavior shifts while in preview, your team should notice fast and adjust intentionally. Assign one owner for “preview watch” each month. Their job: test the core flow, confirm assumptions still hold, alert the team if anything drifts.

The operating model for Spark leads

If you lead a Spark data engineering team, the decision is straightforward.

Use Recent data. Absolutely use it. But use it like a rope, not like wings.

A rope gets you through rough terrain faster when the team is clipped in, communicating, and following route discipline. Wings are what people imagine they have right before they step into empty air.

In practice:

Adopt the feature for speed.
Keep your documentation for continuity.
Keep naming conventions strict for safety.
Keep Spark-side validation for quality.
Treat preview status as a real risk signal, not legal fine print.

That combination is where this feature becomes meaningful. Not because it’s flashy. Because it removes repeated friction at exactly the point where your team loses focus, burns time, and compounds small mistakes.

In data engineering, the catastrophic failures usually start as tiny oversights repeated at scale. Recent data removes one class of those oversights — the constant re-navigation tax — but only if you wrap it in disciplined operating habits.

One less avoidable stumble on steep ground, so your team can spend its strength on the parts of the climb that actually require judgment.

This post was written with help from anthropic/claude-opus-4-6

From CDC to Lakehouse: Making Fabric Eventstreams SQL Survive Contact with Production Spark

Every data team eventually has the same bright idea: “Let’s do CDC so everything is real time.”

What follows is usually less bright.

Somebody wires up connectors, somebody else stands up Kafka, somebody definitely provisions a VM that nobody can later identify, and before long your “modern architecture” has one person who understands it, one person who is afraid of it, and one person who is on call for it. Usually the same person.

So yes, Fabric Eventstreams supporting native CDC connectors for Azure SQL, PostgreSQL, MySQL, and SQL Server sources matters. It removes a lot of plumbing work that used to be mandatory. More importantly, Eventstreams SQL can give you a place to interpret CDC events before they hit your lakehouse and Spark jobs.

That changes the shape of the problem. Not the existence of the problem. Just the shape.

And if you want this to run cleanly at 2:00 AM, the operational details matter more than the architecture diagram.

What Eventstreams SQL actually fixes

Raw CDC events are not analyst-friendly data. They are little envelopes full of intent and drama: insert, update, delete, before image, after image, metadata about the source transaction, and enough ambiguity to start arguments in code review.

If you ship those raw events downstream, every Spark notebook has to interpret them. That means duplicate merge logic and subtle differences between implementations. Two teams can read the same feed and produce slightly different answers. That is how trust in a data platform dies quietly.

Eventstreams SQL can resolve some of those semantics earlier. You can translate event-level changes into cleaner, ready-to-consume records before data lands in destinations.

That can be useful, but it is also where teams start sneaking business logic into the stream layer and then regretting it later.

The bigger question is not just where true merge logic belongs. It is where CDC interpretation belongs at all.

The merge logic decision you cannot avoid

You have two broad options:

Push CDC interpretation upstream into Eventstreams SQL before landing.
Treat Eventstream primarily as a transport layer, land raw or minimally altered CDC into staging, and resolve table semantics in the target engine.

I think option 2 is the better default.

Why? Because once you start doing meaningful CDC interpretation in the stream layer, you now have business logic living in the place that is hardest to reason about, hardest to test, and easiest to forget. You also make it much easier for different downstream systems to drift away from each other.

A cleaner pattern is:

use Eventstream for ingestion, routing, and maybe very light filtering
land into a staging layer
let the target system own merge semantics

That means Azure SQL should own MERGE logic for Azure SQL targets. Lakehouse targets should use Spark or Delta MERGE INTO. The compute engine that owns the table should own the table semantics too.

Trying to make the stream layer do more than that is how teams end up with hidden logic, debugging hell, and architecture diagrams that look cleaner than the actual system.

One important caveat: Eventstreams SQL is not a substitute for Delta MERGE INTO on a Lakehouse table.

Checkpoints: boring, critical, and often broken by accident

Spark Structured Streaming checkpointing is one of those things everybody “knows” until a restart fails and nobody remembers how it works.

Checkpoint locations track stream progress. They are state, not decoration. They are tied to your query plan, and when you change schema or query structure, old checkpoint state may no longer be valid.

This is not an edge case. It is normal lifecycle behavior in evolving pipelines.

Three rules keep you out of trouble:

Use distinct checkpoint paths per stream and per target table.
Version checkpoint paths when query shape or schema changes.
Watch lag between source offsets and committed checkpoint progress.

If you use one checkpoint path for multiple sinks, you are building future pain on purpose. If you change query shape without checkpoint versioning, restart failures are only a matter of timing.

Treat checkpoint migration as a cutover process. Track where old progress stopped, cut to a new checkpoint path intentionally, then retire the previous one once the new job is stable.

The small files problem is not glamorous, but it will hurt you

Most CDC pipelines do not fail dramatically. They fail by becoming slower each week until everyone pretends that 90 seconds is “pretty fast.”

Small files are often the culprit.

CDC streams produce frequent, small increments. Structured Streaming writes micro-batches. Direct lakehouse writes can also produce many tiny files depending on event cadence. Over time, table reads pay the cost in file listing and metadata overhead.

People love to ignore this because compaction feels like janitorial work. It is not. It is core performance engineering.

What works in practice:

Repartition before write based on available Spark cores.
Partition on-disk by ingestion date, and only add other partition keys when query patterns justify it.
Do not partition by operation type. That creates tiny partitions and extra noise.
Run regular OPTIMIZE jobs on high-volume CDC tables.

If you are writing through Spark, control file behavior with repartitioning and trigger cadence. A trigger(processingTime='30 seconds') or trigger(processingTime='2 minutes') can reduce file explosion compared with ultra-frequent batches.

If you are using direct Eventstreams-to-Lakehouse writes, accept that you are trading simplicity for less control and schedule compaction accordingly.

The exact maintenance workflow matters less than having one. One-off cleanup is fine when you are exploring, but scheduled maintenance is what keeps tables healthy over time.

Deletes: decide your philosophy before compliance decides for you

In CDC, inserts and updates are easy to reason about. Deletes are where architecture gets emotional.

For analytics, soft deletes are often the sane default: keep the row, mark is_deleted, add deleted_at, preserve history. This keeps downstream trend analysis and audit trails intact.

Hard deletes are different. If compliance requires physical removal, handle that intentionally, usually with batch logic that applies delete events against target Delta tables after landing.

A reliable pattern is:

Stream all CDC events, including deletes, into staging.
Run scheduled jobs that apply physical deletion rules to curated tables.

That keeps streaming simple and pushes irreversible operations into auditable, controllable execution windows.

Could you do something fancier? Probably. Should you, before you need to? Probably not.

Monitoring: minimum viable or maximum regret

A CDC pipeline with no alerting is just a suspense novel written in production.

Your baseline should cover four things:

Stream health: is each Structured Streaming query active or terminated?
Processing lag: how far are committed offsets behind source offsets?
File accumulation: are table file counts growing faster than compaction can handle?
Source silence: are you receiving events at all from CDC sources?

That last one matters because “no errors” does not mean “healthy.” If CDC gets disabled during maintenance, your pipeline can fail by producing nothing, which looks calm unless you explicitly monitor for inactivity windows.

Fabric Activator-based alerts can be useful for surfacing threshold breaches. Tie thresholds to actual SLAs, not vibes.

A practical starting playbook

If you are standing this up now, keep it simple:

Enable CDC at the source (sys.sp_cdc_enable_db and sys.sp_cdc_enable_table where applicable).
Validate flow end to end with one real table before scaling breadth.
Segment tables early: simple merge logic in Eventstreams SQL, complex logic in Spark.
Define checkpoint path standards before the first production deploy.
Pick trigger intervals that balance latency with file quality.
Schedule OPTIMIZE from day one, not after performance complaints.
Document merge ownership per table so changes do not become archaeology.

None of this is exotic. That is exactly the point.

Good CDC architecture is usually not a story about cleverness. It is a story about disciplined boring decisions made early, then repeated consistently.

Final take

Fabric Eventstreams plus Spark can give teams a cleaner CDC path than the old connector-plus-consumer patchwork. Native CDC connectors can reduce integration grind. But I would still keep meaningful CDC interpretation and merge behavior in the target compute engine whenever possible. Spark Structured Streaming remains a practical choice for controlled writes and advanced merge behavior.

But the real success criteria are operational.

If you manage checkpoints like real state, control file growth before it controls you, choose a deliberate delete strategy, and wire up monitoring that catches silence as well as failure, this architecture can work well in production.

If you skip those details, it still works right up until the exact moment it doesn’t, which usually happens late, loud, and at the least convenient hour in human history.

That is less a Fabric problem than a production engineering problem. Fabric can simplify parts of the workflow, but it does not remove the need for operational discipline.

This post was written with help from anthropic/claude-opus-4-6

What “Execute Power Query Programmatically” Means for Fabric Spark Teams

Somewhere in a Fabric workspace right now, two teams are maintaining the same transformation twice.

The BI team owns it in Power Query. The Spark team rewrote it in PySpark so a notebook could run it on demand. Both versions work. Both versions drift. Both versions break at different times.

That was normal.

Microsoft’s new Execute Query API (preview) is the first real shot at ending that duplication. It lets you execute Power Query (M) through a public REST API from notebooks, pipelines, or any HTTP client, then stream results back in Apache Arrow format.

For Spark teams, this isn’t a minor feature. It changes where transformation logic can live.

What actually shipped

At a technical level, the API is simple:

Endpoint: POST /v1/workspaces/{workspaceId}/dataflows/{dataflowId}/executeQuery
Input: a queryName, with optional customMashupDocument (full M script)
Output: Arrow stream (application/vnd.apache.arrow.stream)

The execution context comes from a Dataflow Gen2 artifact in your workspace. Its configured connections determine what data sources the query can access and which credentials are used.

That single detail matters more than it looks. You’re not just “calling M from Spark.” You’re running M under dataflow-governed connectivity and permissions.

Why Spark engineers should care

Before this API, Spark teams usually had two options:

Rewrite M logic in PySpark
Or wait for a dataflow refresh and consume the output later

Neither is great. Rewrites create long-term maintenance debt. Refresh handoffs add latency and orchestration fragility.

Now you can execute the transformation inline and keep moving.

A minimal call path looks like this:

import requests import pyarrow as pa  response = requests.post(url, headers=headers, json=request_body, stream=True)  with pa.ipc.open_stream(response.raw) as reader:     pandas_df = reader.read_pandas()  spark_df = spark.createDataFrame(pandas_df)

No CSV hop. No JSON schema drift. No custom parsing layer.

The non-negotiable constraints

This feature is useful, but it is not magic. There are hard boundaries.

90-second timeout
– Query evaluations must complete within 90 seconds.
– This is ideal for fast lookups, enrichment, and reference joins—not heavy batch reshaping.
Read-only execution
– The API executes queries only. It doesn’t support write actions.
– If your notebook flow assumes “query + write” in one API step, redesign it.
Native query rule for custom mashups
– customMashupDocument does not allow native database queries.
– But if a query defined inside the dataflow itself uses native queries, that query can be executed.
– This distinction will trip people if they treat inline M and stored dataflow queries as equivalent.
Performance depends on folding and query complexity
– Bad folding or expensive transformations can burn your 90-second window quickly.
– You need folding-aware query reviews before production rollout.

Practical rollout plan for Spark teams

If you lead a Fabric Spark team, do this in order.

1) Inventory duplication first

Build a short list of transformations currently duplicated between M and PySpark. Start with transformations that are stable, reused often, and mostly read-oriented.

2) Stand up a dedicated execution dataflow

Create one Dataflow Gen2 artifact specifically for API-backed execution contexts.

Keep connections explicit and reviewed
Restrict who can modify those connections
Treat the artifact like infrastructure, not ad hoc workspace clutter

3) Wrap Execute Query behind one notebook utility

Don’t let every notebook hand-roll HTTP logic. Create one shared helper that handles:

token acquisition
request construction
Arrow stream parsing
error handling
timeout/response logging

If the API returns 202 (long-running operation), honor Location and Retry-After instead of guessing polling behavior.

4) Add governance checks before scale

Because execution runs under dataflow connection scope, validate:

who can execute
what connections they indirectly inherit
which data sources become reachable through that path

If your governance model assumes notebook identity is the only control plane, this API changes that assumption.

5) Monitor capacity from day one

Microsoft surfaces this usage in Capacity Metrics as “Dataflows Gen2 Run Query API”, billed on the same meter family as Dataflow Gen2 refresh operations. Watch this early so you don’t discover new spend after adoption is already wide.

Where this fits (and where it doesn’t)

Use it when you need:

shared transformation logic between BI and engineering
fast, read-oriented query execution from Spark/pipelines/apps
connector and gateway reach already configured in dataflows

Avoid it when you need:

long-running transformations
write-heavy jobs
mission-critical production paths with zero preview risk tolerance

The REST API docs still mark this as preview and “not recommended for production use.” Treat that warning as real, not ceremonial.

The organizational shift hiding behind the API

The technical win is straightforward: fewer rewrites, faster integration, cleaner data handoffs.

The harder change is social.

When Spark notebooks can directly execute M, ownership lines between BI and data engineering need to be explicit. Who owns business logic? Who owns runtime reliability? Who approves connection scope?

Teams that answer those questions early will move fast.

Teams that don’t will just reinvent the same duplication problem with a new endpoint.

Source notes

Microsoft Fabric Blog: Evaluate Power Query Programmatically in Microsoft Fabric (Preview)
https://blog.fabric.microsoft.com/en-US/blog/execute-power-query-programmatically-in-microsoft-fabric/
Microsoft Learn REST API: Query Execution – Execute Query
https://learn.microsoft.com/en-us/rest/api/fabric/dataflow/query-execution/execute-query
Microsoft Learn: Pricing for Dataflow Gen2
https://learn.microsoft.com/en-us/fabric/data-factory/pricing-dataflows-gen2

This post was written with help from anthropic/claude-opus-4-6.

What the February 2026 Fabric Influencers Spotlight means for your Spark team

Microsoft published its February 2026 Fabric Influencers Spotlight last week. Twelve community posts. MVPs and Super Users. Most people skim the list. Maybe bookmark a link. Move on.

Don’t.

Three of those posts carry signals that should change how your Spark data-engineering team operates in production. Not next quarter. Now.

Signal 1: Get your production code out of notebooks

Matthias Falland’s Fabric Friday episode makes the case plainly: notebooks are great for development but risky in production. That framing resonates with a lot of production teams—and for good reason.

Here’s the nuance. Microsoft has said there’s no inherent difference in performance or monitoring capabilities between Spark Job Definitions and notebooks. Both produce Spark logs. Both run on the same compute. The gap isn’t in what the platform offers. It’s in what each artifact encourages.

Notebooks encourage improvisation. Someone edits a cell at 2 AM. Cell state carries between runs. An error gets swallowed inside an output cell and nobody notices until downstream tables go stale. That’s not a platform limitation. That’s a human-factors problem. And production environments are where human-factors problems become outages.

Spark Job Definitions push you toward cleaner habits. One file per job. No cell state. Explicit parameters. Better modularity. The execution boundary is sharper, and sharper boundaries make failures easier to diagnose.

If your team runs notebooks on a schedule through pipelines, here’s the migration:

Audit every notebook that runs on a schedule or gets triggered by a pipeline. Count them. You’ll be surprised.
Extract the transformation logic into standalone Python or Scala files. One file per job. No magic. No “run all cells.”
Create Spark Job Definitions for each. Map your existing notebook parameters to SJD parameters. They work the same way—just without the cell baggage.
Wire them into your pipeline activities. Replace the notebook activity with an SJD activity. The orchestration stays identical.
Keep the notebooks for development and ad-hoc exploration. That’s where they shine.

A team of three can typically convert a dozen notebooks in a week. The hard part isn’t the migration. It’s the decision to start.

Signal 2: Direct Lake changes how you write to your lakehouse

Pallavi Routaray’s post on Direct Lake architecture is the most consequential piece in the whole spotlight. Easy to miss because the title sounds like a Power BI topic.

It’s not. It’s a Spark topic.

Direct Lake mode reads Parquet files directly from OneLake. No import copy. No DirectQuery overhead. But it only works well if your Spark jobs write data in a way that Direct Lake can consume efficiently. Get the file layout wrong and your semantic model falls back to DirectQuery silently. Performance craters. Your BI team blames you. Nobody knows why.

Here’s the production checklist:

Enable V-Order optimization on your Delta tables. V-Order sorts and compresses Parquet files for Direct Lake’s columnar read path. Here’s the catch: V-Order is disabled by default in new Fabric workspaces, optimized for write-heavy data engineering workloads. If your workspace was created recently, you need to enable it explicitly. Check your workspace settings—or set it at the table property level. Don’t assume it’s on.
Control your file sizes. Microsoft’s guidance is clear: keep the number of Parquet files small and use large row groups. If your Spark jobs produce thousands of tiny files, Direct Lake will hit its file-count limits and fall back. Run OPTIMIZE on your Delta tables after write operations. Compact aggressively.
Partition deliberately. Over-partitioning creates too many small files. Under-partitioning creates files that are too large for efficient column pruning. Partition by the grain your BI team actually filters on. Ask them. Don’t guess.
Watch for schema drift. Direct Lake models bind to specific columns at creation time. If your Spark job adds or renames a column, the semantic model breaks. Coordinate schema changes explicitly. No silent ALTER TABLE commands on Friday afternoons.

The big risk here: most Spark teams don’t know their output feeds a Direct Lake model. The BI team built it after the fact. Start by mapping which of your Delta tables have Direct Lake semantic models sitting on top. If you don’t know, find out today.

Signal 3: CI/CD for Fabric just got real

Kevin Chant’s post covers the fabric-cicd tool reaching general availability for configuration-based deployments with Azure DevOps. This is verified and it matters more than it sounds.

Until now, deploying Fabric artifacts across environments—dev, test, prod—was either manual or held together with custom scripts that broke every time the API changed. The fabric-cicd tool gives you a supported, versioned path.

For Spark teams:

Your Spark Job Definitions, lakehouse configurations, and pipeline definitions can live in source control and deploy through a proper pipeline. No more “I’ll just update it in the portal.”
Configuration differences between environments—connection strings, capacity settings, lakehouse names—get handled through configuration files. Not by editing items in the portal after deployment.
You can roll back. You can diff. You can review before promoting to production. The basic hygiene that every other engineering discipline has had for decades.

Here’s the migration path:

Install fabric-cicd from the latest release. Follow Chant’s posts for the Azure DevOps YAML pipeline specifics.
Export your existing workspace items to a Git repository. Fabric’s Git integration handles this natively.
Build your environment-specific configuration files. One per environment. Map the items that differ: capacity, lakehouse, connections.
Set up your Azure DevOps pipeline to run fabric-cicd on merge to main. Start with dry-run mode until you trust it.
Remove portal-level edit access for production workspaces. This is the hard step. It’s also the one that prevents the next outage.

The deeper pattern

These three signals connect. Falland tells you to move your Spark code into artifacts built for production discipline. Routaray tells you how to write your output so downstream models don’t silently degrade. Chant tells you how to deploy the whole thing reliably across environments.

That’s a production pipeline. End to end. Code that runs cleanly, writes data correctly, and deploys safely.

The February spotlight also includes Open Mirroring hands-on guidance from Inturi Suparna Babu and a Fabric Data Agent walkthrough from Shubham Rai. Both are worth a read if you’re evaluating data replication strategies or AI-assisted query patterns over your lakehouse. But for Spark teams running production workloads, the three signals above are where the action is.

Your rollout checklist for March

Inventory all scheduled notebooks. Tag them by risk: frequency, data volume, downstream dependencies.
Convert the highest-risk notebook to a Spark Job Definition this week. Validate it runs identically.
Audit Delta table write patterns for any table that feeds a Direct Lake model. Check that V-Order is enabled. Run OPTIMIZE to compact files.
Install fabric-cicd. Connect your workspace to Git. Build your first environment config.
Pick one pipeline to deploy through CI/CD end-to-end. Prove it works before scaling.

Five items. All concrete. All doable in March.

The community did the research. Your job is to act on it.

This post was written with help from anthropic/claude-opus-4-6

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

There’s a connector that ships with every Fabric Spark runtime. It’s pre-installed. It requires no setup. And it lets your Spark notebooks read from—and write to—Fabric Data Warehouse tables as naturally as they read Delta tables from a Lakehouse.

Most Fabric Spark users don’t know it exists. The ones who do often run into the same three or four surprises. Let’s fix both problems.

What the connector actually is

The Spark connector for Fabric Data Warehouse (synapsesql) is a built-in extension to the Spark DataFrame API. It uses the TDS protocol to talk directly to the SQL engine behind your Warehouse (or the SQL analytics endpoint of a Lakehouse). You get read and write access to Warehouse tables from PySpark, Scala Spark, or Spark SQL.

One line of code to read:

from com.microsoft.spark.fabric.Constants import Constants  df = spark.read.synapsesql("my_warehouse.dbo.sales_fact")

One line to write:

df.write.mode("append").synapsesql("my_warehouse.dbo.sales_fact")

No connection strings. No passwords. No JDBC driver management. Authentication flows through Microsoft Entra—same identity you’re logged into your Fabric workspace with. The connector resolves the SQL endpoint automatically based on workspace context.

That’s the happy path. Now let’s talk about what actually happens when you use it.

Reading: the part that mostly just works

Reading from a Warehouse table into a Spark DataFrame is the connector’s strength. The synapsesql() call supports the full three-part naming convention: warehouse_name.schema_name.table_or_view_name. It works for tables and views, including views with joins across schemas.

A few things that are genuinely useful:

Predicate pushdown works. When you chain .filter() or .limit() onto your DataFrame, the connector pushes those constraints to the SQL engine. You’re not pulling the full table into Spark memory and then filtering—the SQL engine handles the filter and sends back the subset. This matters when your Warehouse tables have hundreds of millions of rows and you only need a time-sliced sample.

df = spark.read.synapsesql("my_warehouse.dbo.sales_fact") \     .filter("order_date >= '2026-01-01'") \     .select("order_id", "customer_id", "amount")

Cross-workspace reads work. If your Warehouse lives in a different workspace than your notebook’s attached Lakehouse, you pass the workspace ID:

df = spark.read \     .option(Constants.WorkspaceId, "<target-workspace-id>") \     .option(Constants.DatawarehouseId, "<warehouse-item-id>") \     .synapsesql("my_warehouse.dbo.sales_fact")

This is genuinely powerful for hub-and-spoke architectures where your curated Warehouse sits in a production workspace and your data science notebooks live in a sandbox workspace.

Parallel reads are available. For large tables, you can partition the read across multiple Spark tasks, similar to spark.read.jdbc:

df = spark.read \     .option("partitionColumn", "order_id") \     .option("lowerBound", 1) \     .option("upperBound", 10000000) \     .option("numPartitions", 8) \     .synapsesql("my_warehouse.dbo.sales_fact")

This splits the query into eight parallel reads, each fetching a range of order_id. Without this, you get a single-threaded read that will bottleneck on large tables.

Security models pass through. If your Warehouse has object-level security (OLS), row-level security (RLS), or column-level security (CLS), those policies are enforced when Spark reads the data. Your notebook sees exactly what your identity is authorized to see. This is a meaningful difference from reading Delta files directly via OneLake, where security operates at the workspace or folder level.

Custom T-SQL queries work too. You’re not limited to reading tables—you can pass arbitrary T-SQL:

df = spark.read \     .option(Constants.DatabaseName, "my_warehouse") \     .synapsesql("SELECT TOP 1000 * FROM dbo.sales_fact WHERE region = 'WEST'")

This is handy for complex aggregations or when you want the SQL engine to do the heavy lifting before data enters Spark.

Writing: the part with surprises

Write support for the Spark-to-Warehouse connector became generally available with Runtime 1.3. It works, and it solves a real architectural problem—but it has mechanics you need to understand.

How writes actually work under the hood

The connector uses a two-phase process:

Stage: Spark writes your DataFrame to intermediate Parquet files in a staging location.
Load: The connector issues a COPY INTO command, telling the Warehouse SQL engine to ingest from the staged files.

This is the same COPY INTO pattern that powers bulk ingestion into Fabric Data Warehouse generally. It’s optimized for throughput. It is not optimized for latency on small writes.

If you’re writing a DataFrame with 50 rows, the overhead of staging files and issuing COPY INTO means the write takes materially longer than you’d expect. For small, frequent writes, this connector is not the right tool. Use T-SQL INSERT statements through a SQL connection instead.

For batch writes of thousands to millions of rows, the connector performs well. The COPY INTO path is what the Warehouse was designed for.

Save modes

The connector supports four save modes:

errorifexists (default): Fails if the table already exists.
ignore: Silently skips the write if the table exists.
overwrite: Drops and recreates the table with new data.
append: Adds rows to the existing table.

df.write.mode("overwrite").synapsesql("my_warehouse.dbo.daily_aggregates")

A common pattern: Spark computes daily aggregations from Lakehouse Delta tables, then writes the results to a Warehouse table that Power BI reports connect to. The Warehouse’s result set caching (now generally available as of January 2026) means subsequent Power BI refreshes hit cache instead of recomputing.

The timestamp_ntz gotcha

This is the single most common error people hit when writing to a Warehouse from Spark.

If your DataFrame contains timestamp_ntz (timestamp without time zone) columns, the write will fail. Fabric Data Warehouse expects time-zone-aware timestamps. The fix is a cast before you write:

from pyspark.sql.functions import col  for c in df.columns:     if dict(df.dtypes)[c] == "timestamp_ntz":         df = df.withColumn(c, col(c).cast("timestamp"))  df.write.mode("append").synapsesql("my_warehouse.dbo.target_table")

This is not documented prominently enough. If you see a Py4JJavaError during write that mentions type conversion, timestamps are the first thing to check.

What you can’t write to

The connector writes to Warehouse tables only. You cannot write to the SQL analytics endpoint of a Lakehouse—it’s read-only. If you try, you’ll get an error. This seems obvious but trips people up because the same synapsesql() method handles both reads from Warehouses and Lakehouse SQL endpoints.

Private Link limitations

If Private Link is enabled at the workspace level, both read and write operations through the connector are unsupported. If Private Link is enabled at the tenant level only, writes are unsupported but reads still work. This is a significant limitation for security-conscious deployments. Check your network configuration before building pipelines that depend on this connector.

Time Travel is not supported

Fabric Data Warehouse now supports Time Travel queries. However, the Spark connector does not pass through Time Travel syntax. If you need to query a table as of a specific point in time, you’ll need to use a T-SQL connection directly rather than the synapsesql() method.

When to use Warehouse vs. Lakehouse as your serving layer

This is the architectural question that the connector’s existence forces you to answer. You’ve got data in your Lakehouse. Spark has transformed it. Now where does it go?

Use Lakehouse Delta tables when:

Your consumers are other Spark notebooks or Spark-based ML pipelines.
You need schema evolution flexibility (Delta’s schema merge).
You want to use OPTIMIZE, VACUUM, and Z-ORDER for table maintenance.
Your data scientists need direct file access through OneLake APIs.

Use Warehouse tables when:

Your primary consumers are Power BI reports or T-SQL analysts.
You need the Warehouse’s result set caching for repeated query patterns.
You need fine-grained security (RLS, CLS, OLS) that passes through to all consumers.
You want to use T-SQL stored procedures, views, and MERGE statements for downstream transformations.
You need cross-database queries that join Warehouse tables with Lakehouse tables or other Warehouse tables.

Use both when:

Spark processes and stores data in the Lakehouse (bronze → silver → gold medallion layers), then the connector writes final aggregations or serving tables to the Warehouse.
The Warehouse serves as the “last mile” between your data engineering work and your business intelligence layer.

The January 2026 GA of MERGE in Fabric Data Warehouse makes the “write to Warehouse” pattern significantly more useful. You can now do incremental upserts: Spark writes a staging table, then a T-SQL MERGE reconciles it with the target. This is a common pattern in data warehousing that was previously awkward in Fabric.

A concrete pattern: Spark ETL → Warehouse serving layer

Here’s the pattern I see working well in production:

# 1. Read from Lakehouse Delta tables (Spark native) bronze = spark.read.format("delta").load("Tables/raw_orders")  # 2. Transform in Spark silver = bronze.filter(col("status") != "cancelled") \     .withColumn("order_date", col("order_ts").cast("date")) \     .withColumn("amount_usd", col("amount") * col("fx_rate"))  gold = silver.groupBy("region", "order_date") \     .agg(         count("order_id").alias("order_count"),         sum("amount_usd").alias("total_revenue")     )  # 3. Write to Warehouse for Power BI consumption gold.write.mode("overwrite").synapsesql("analytics_warehouse.dbo.daily_revenue")

The Lakehouse owns the raw and transformed data. Spark does the heavy compute. The Warehouse serves the final tables to downstream consumers with T-SQL access, caching, and fine-grained security.

The alternative—writing gold tables to the Lakehouse and having Power BI connect via the SQL analytics endpoint—also works. But the SQL analytics endpoint has a metadata sync delay after Spark writes new data. The Warehouse table is immediately consistent after the COPY INTO completes. If your reporting needs to reflect the latest pipeline run without a sync lag, the Warehouse path is more reliable.

Cross-database queries: the glue between them

Once you have data in both a Lakehouse and a Warehouse in the same workspace, you can query across them using T-SQL cross-database queries from the Warehouse:

SELECT w.customer_id, w.total_revenue, l.customer_segment FROM analytics_warehouse.dbo.daily_revenue AS w JOIN my_lakehouse.dbo.customer_dim AS l     ON w.customer_id = l.customer_id

This means your Warehouse doesn’t need to contain all the data. It can hold the curated aggregations while joining against dimension tables that live in the Lakehouse. No data movement. No duplication. The SQL engine resolves both sources through OneLake.

Performance notes from the field

A few observations from real workloads:

Reads are faster than you expect. The TDS protocol connection to the Warehouse SQL engine is efficient. For typical analytical queries returning thousands to low millions of rows, the synapsesql() read is competitive with reading Delta files directly, especially when the Warehouse has statistics and result set caching enabled.

Writes are slower than Lakehouse writes. The two-phase staging + COPY INTO process adds overhead versus a direct df.write.format("delta").save() to Lakehouse tables. For a DataFrame with 10 million rows, expect the Warehouse write to take 2-5x longer than an equivalent Lakehouse Delta write. This is the tradeoff for getting immediate T-SQL access with full Warehouse capabilities.

Use parallel reads for large tables. The default single-partition read will bottleneck. Set numPartitions to match your Spark cluster’s available cores for large reads. The performance improvement is often 4-8x.

Proactive and incremental statistics refresh. As of January 2026, Fabric Data Warehouse supports proactive statistics refresh and incremental statistics. This means the query optimizer keeps statistics up to date automatically. Your synapsesql() reads benefit from better query plans without manual UPDATE STATISTICS calls.

The honest summary

The Spark connector for Fabric Data Warehouse is a well-designed bridge between two systems that many teams use side by side. It makes the read path simple and the write path possible without leaving your Spark notebook.

It is not a replacement for writing to Lakehouse Delta tables. It is an additional output path for when your downstream consumers need T-SQL, fine-grained security, result set caching, or immediate consistency. Use it when the Warehouse is the right serving layer. Don’t use it when Lakehouse is sufficient.

The biggest wins come from combining both: Spark for compute, Lakehouse for storage, Warehouse for serving. The connector is the plumbing that makes that architecture work without data pipelines in between.

If you’re heading to FabCon Atlanta (March 16-20, 2026), both the Data Warehouse and Data Engineering teams will be there. It’s a good place to pressure-test your architecture and see what’s coming next.

This post was written with help from anthropic/claude-opus-4-6

What SQL database in Fabric actually means for your Spark pipelines

There is a particular kind of excitement that sweeps through data engineering teams when Microsoft announces a new database option. It is the same mixture of curiosity and low-grade dread you might feel upon learning that your neighborhood is getting a new highway interchange. Useful, probably. Disruptive, definitely. Someone is going to have to figure out the on-ramps.

SQL database in Fabric went generally available in November 2025. Built on the same SQL Database Engine that powers Azure SQL Database, it is the first fully SaaS-native operational database living inside Microsoft Fabric. More than 50,000 SQL databases were created during the preview period alone. If you spend your days writing Spark notebooks, building lakehouses, and tending ETL pipelines, this thing will change how you work whether you plan for it or not.

Here is what you need to know, what you should actually do about it, and where the potholes are hiding.

Your operational data now lands in OneLake automatically

The headline feature for Spark teams is automatic replication to OneLake. When data gets written to a SQL database in Fabric, it mirrors to OneLake as Delta tables in near real-time. No pipelines. No connectors. No orchestration jobs that fail silently at 2 AM and ruin your Monday.

This sounds almost too convenient, and in some ways it is. The mirrored Delta tables arrive in an open format your Spark notebooks can read directly. You point a DataFrame at the mirrored location, run your transformations, and push results to your gold layer without ever having written an ingestion pipeline for that source.

If your team currently runs nightly batch loads from Azure SQL or SQL Server into a lakehouse, this is a real shift. That entire category of extract-and-load work can shrink or vanish. But “can” is doing heavy lifting in that sentence, and we need to talk about why.

How this changes daily Spark development

The practical impact shows up in a few specific places.

Reading operational data gets simpler. Instead of maintaining JDBC connections, managing credential rotation, and writing Spark code to pull from SQL, you read Delta tables from OneLake. The data is already there. Your Spark cluster does not need network access to the SQL database itself. One fewer firewall rule, one fewer connection string in your key vault, one fewer thing that breaks when someone rotates a password on a Friday afternoon.

Schema changes arrive faster than you can react. With batch ETL, you had a buffer. The pipeline would fail, someone would get an alert, and you had time to adapt your downstream notebooks. Near real-time mirroring removes that cushion. A column rename or type change in the operational database shows up in your Delta tables within seconds to minutes. If your Spark jobs reference columns by name (they do), you need schema evolution handling that most teams have not built yet.

Think about what happens when a developer on the application side renames customer_id to cust_id on a Wednesday afternoon. Your batch pipeline would have failed that night, you would have caught it Thursday morning, and the fix would be a one-line column alias. With mirroring, your running Spark job gets a AnalysisException: cannot resolve 'customer_id' mid-stream. The fix is the same, but the timing is worse.

SQL users can now query your lakehouse data directly. SQL database in Fabric supports OPENROWSET and External Tables for querying OneLake data in CSV, Parquet, and JSON formats. Your SQL-writing colleagues can query lakehouse data without Spark. That sounds like a collaboration win until a SQL user runs a full table scan on your carefully partitioned Parquet files and you both learn something new about capacity throttling.

Establish clear ownership of shared datasets early. Document which OneLake paths are read-safe for SQL access and which ones carry performance risk.

The SQL Analytics Endpoint changes reporting paths. Every SQL database in Fabric gets a SQL Analytics Endpoint that sits on top of the mirrored data. Power BI can hit this endpoint with Direct Lake, which means your Spark team might no longer be in the critical path for building reporting datasets. If you have spent months building and maintaining a medallion architecture primarily to serve Power BI, parts of that effort become optional. Whether that feels like relief or irrelevance depends on your org chart.

Migration risks worth planning for

Before you start ripping out pipelines, here are the things that deserve a red flag on your project board.

Capacity billing is shared, and the math is unforgiving. SQL database in Fabric consumes the same Fabric capacity as your Spark jobs, warehouses, and Power BI refreshes. If someone provisions a heavily used SQL database on the same capacity where your Spark notebooks run, you will feel it. Fabric capacity is a zero-sum game. The new player at the table did not bring extra chips.

Run a two-week trial on a dedicated capacity before mixing SQL database workloads with existing Spark production. Use the Microsoft Fabric Capacity Metrics App to understand exactly how many CUs the database consumes at rest and under load.

Near real-time is not real-time, and the gap varies. The mirroring latency depends on transaction volume and capacity pressure. Under light load, changes appear in seconds. Under heavy load on a congested capacity, you might see minutes of lag. If your Spark pipelines assume data completeness at a specific watermark, you need to measure actual replication lag under realistic conditions. A simple row-count comparison between the SQL database and the mirrored Delta table, run every five minutes for a week, will tell you more than any documentation.

Security boundaries do not mirror perfectly. SQL database in Fabric supports Microsoft Entra authentication, row-level security, customer-managed keys, and SQL auditing (in preview). Your lakehouse uses OneLake RBAC, workspace roles, and Spark-level access controls. The mirrored data inherits some but not all of these boundaries. Row-level security in the SQL database, for instance, does not automatically apply to the mirrored Delta table in OneLake. If you have sensitive columns, verify the access controls on the mirror before your entire data team has read access.

Vendor lock-in compounds quietly. Every pipeline you remove and every JDBC connector you delete makes you more dependent on Fabric-internal mechanisms. If you later need to run Spark on Databricks, on EMR, or on bare-metal clusters, your data ingestion path disappears. This is not a reason to avoid the feature, but it is a reason to document what you replaced and keep a migration playbook somewhere that is not a Confluence page nobody remembers exists.

A rollout checklist for Spark teams

If you are ready to start integrating SQL database in Fabric into your data engineering stack, here is a practical sequence.

Inventory your SQL-sourced pipelines. List every Spark job that reads from Azure SQL, SQL Server, or any SQL-based source via JDBC, linked services, or copy activities. Note the refresh frequency, data volume, and downstream dependencies. If you cannot produce this list in under an hour, that is itself a useful finding.
Provision a SQL database in Fabric on a non-production capacity. Do not test this on production. Capacity contention is real, and you want to understand billing impact before it appears on someone else’s finance report.
Mirror a single non-critical table and validate. Pick a reference table, something small and stable. Confirm the Delta table lands in OneLake, check the schema, verify column types, and read it from a Spark notebook. Compare row counts and checksums against the source.
Measure replication lag under real load. Insert, update, and delete rows in the SQL database and time how quickly those changes appear in the mirrored Delta table. Run this test during your normal capacity utilization window, not during off-hours when capacity is idle and results are misleadingly fast.
Test schema evolution deliberately. Add a column. Rename a column. Change a data type. Observe what happens to the mirrored Delta table and to any Spark jobs reading it. Build your error handling before this surprises you in production.
Audit security boundaries on the mirror. Check whether row-level security, column masking, or other access controls in the SQL database are reflected in the mirrored OneLake data. Document gaps and decide whether they are acceptable for your data classification. If they are not, add a data masking step between the mirror and your Spark consumers.
Run a cost comparison over two weeks. Compare the Fabric capacity consumption of the SQL database plus mirroring against your current pipeline compute costs. Include the engineering time saved, but be honest. “We saved two hours a month of pipeline maintenance” is a real number. “We saved countless engineering hours” is not.
Deprecate one pipeline as a pilot. Pick your simplest SQL-sourced pipeline, redirect the downstream Spark job to read from the mirrored Delta table, and run both paths in parallel for at least two sprints. When you are confident, decommission the old pipeline and update your runbooks.

Vector search: a side door into AI workloads

SQL database in Fabric supports the native vector data type and vector indexing. This opens up retrieval-augmented generation (RAG) patterns directly inside the database, without adding a separate vector store to your architecture.

For Spark teams building ML pipelines or feeding large language models, the value is in co-location. You can store embeddings alongside your operational data, run similarity searches in SQL, and then access the same data from Spark for model training or batch inference. A product catalog with embeddings stored as vectors in SQL can serve both a real-time search API and a nightly Spark training job without data duplication.

This will not replace Pinecone or Weaviate for teams running high-throughput similarity search at scale. But for teams running modest-scale RAG or semantic search against operational data, it removes one service from the architecture and one deployment from the on-call rotation. That is not nothing.

What to expect next

Microsoft has made it clear that SQL database in Fabric is part of a longer play to bring operational data fully into the Fabric ecosystem. The integration with Copilot in the Query Editor, support for Terraform and Fabric CLI automation, and the first-ever SQLCon conference co-located with FabCon Atlanta in March 2026 all point the same direction: the wall between transactional and analytical workloads is getting thinner.

For Spark data engineering teams, the right move is not to panic and rewrite everything. It is to understand the mechanics, run a controlled test, and make deliberate decisions about which pipelines to retire and which to keep. The highway interchange is open. You just need to figure out your on-ramp.

This post was written with help from Opus 4.6

The Best Thing That Ever Happened to Your Spark Pipeline Is a SQL Database

Here’s a counterintuitive claim: the most important announcement for Fabric Spark teams in early 2026 has nothing to do with Spark.

It’s a SQL database.

Specifically, it’s the rapid adoption of SQL database in Microsoft Fabric—a fully managed, SaaS-native transactional database that went GA in November 2025 and has been quietly reshaping how production data flows into lakehouse architectures ever since. If you’re a data engineer running Spark workloads in Fabric, this changes more than you think.

The ETL Pipeline You Can Delete

Most Spark data engineers have a familiar pain point: getting operational data from transactional systems into the lakehouse. You build ingestion pipelines. You schedule nightly batch loads. You wrestle with CDC (change data capture) configurations, watermark columns, and retry logic. You maintain all of it, forever.

SQL database in Fabric eliminates that entire layer.

When data lands in a Fabric SQL database, it’s automatically replicated to OneLake as Delta tables in near real-time. No pipelines. No Spark ingestion jobs. No orchestration. The data just appears, already in the open Delta format your notebooks and Spark jobs expect.

This isn’t a minor convenience—it’s an architectural shift. Every ingestion pipeline you don’t write is a pipeline you don’t debug at 2 AM.

What This Actually Looks Like in Practice

Let’s say you’re building an analytics layer on top of an operational SaaS application. Today, your architecture probably looks something like this:

Application writes to Azure SQL or Cosmos DB
ADF or Spark job pulls data on a schedule
Data lands in a lakehouse as Delta tables
Downstream Spark jobs transform and aggregate

With SQL database in Fabric, steps 2 and 3 vanish. Your application writes directly to the Fabric SQL database, and the mirrored Delta tables are immediately available for Spark processing. Here’s what your downstream notebook looks like now:

# Read operational data directly — no ingestion pipeline needed # The SQL database auto-mirrors to OneLake as Delta tables orders_df = spark.read.format("delta").load(     "abfss://your-workspace@onelake.dfs.fabric.microsoft.com/your-sqldb.SQLDatabase/dbo.Orders" )  # Your transformation logic stays the same from pyspark.sql import functions as F  daily_revenue = (     orders_df     .filter(F.col("order_date") >= F.date_sub(F.current_date(), 7))     .groupBy("product_category")     .agg(         F.sum("total_amount").alias("revenue"),         F.countDistinct("customer_id").alias("unique_customers")     )     .orderBy(F.desc("revenue")) )  daily_revenue.write.format("delta").mode("overwrite").saveAsTable("gold.weekly_revenue_by_category")

The Spark code doesn’t change. What changes is everything upstream of it.

The Migration Risk Nobody’s Talking About

Here’s where it gets interesting—and where Malcolm Gladwell would lean forward in his chair. The biggest risk of SQL database in Fabric isn’t technical. It’s organizational.

Teams that have invested heavily in ingestion infrastructure will face a classic innovator’s dilemma: the new path is simpler, but the old path already works. The temptation is to keep running your existing ADF pipelines alongside the new auto-mirroring capability, creating a hybrid architecture that’s worse than either approach alone.

My recommendation: don’t hybrid. Pick a workload, migrate it end-to-end, and measure. Here’s a concrete rollout checklist:

Identify a candidate workload — Look for Spark jobs whose primary purpose is pulling data from a SQL source into Delta tables. These are your highest-value migration targets.
Provision a Fabric SQL database — It takes seconds. You provide a name; Fabric handles the rest. Autoscaling and auto-pause are built in.
Redirect your application writes — Point your operational application to the new Fabric SQL database. The engine is the same SQL Database Engine as Azure SQL, so T-SQL compatibility is high.
Validate the Delta mirror — Confirm that your data is appearing in OneLake. Check schema fidelity, latency, and row counts:

# In your Spark notebook, validate the mirrored data spark.sql("""     SELECT COUNT(*) as row_count,            MAX(modified_date) as latest_record,            MIN(modified_date) as earliest_record     FROM your_sqldb.dbo.Orders """).show()

Decommission the ingestion pipeline — Once validated, turn off the ADF or Spark ingestion job. Don’t just disable it—delete it. Zombie pipelines are how technical debt accumulates.
Update your monitoring — Your existing data quality checks should still work since the Delta tables have the same schema. But update your alerting to watch for mirror latency instead of pipeline run failures.

The AI Angle Matters for Spark Teams Too

There’s a second dimension to this announcement that Spark engineers should pay attention to: the native vector data type in SQL database supports semantic search and RAG patterns directly in the transactional layer.

Why does that matter for Spark teams? Because it means your embedding pipelines can write vectors back to the same database your application reads from—closing the loop between batch ML processing in Spark and real-time serving in SQL. Instead of maintaining a separate vector store (Pinecone, Qdrant, etc.), you use the same SQL database that’s already mirrored into your lakehouse.

This is the kind of architectural simplification that compounds over time. Fewer systems means fewer failure modes, fewer credentials to manage, and fewer things to explain to your successor.

The Rollout Checklist

This week: Inventory your existing ingestion pipelines. How many just move data from SQL sources to Delta?
This sprint: Provision a Fabric SQL database and test the auto-mirror with a non-critical workload.
This quarter: Migrate your highest-volume ingestion pipeline and measure CU savings.
Track: Mirror latency, CU consumption before/after, and pipeline maintenance hours eliminated.

SQL database in Fabric went GA in November 2025 with enterprise features including row-level security, customer-managed keys, and private endpoints. For the full list of GA capabilities, see the official announcement. To understand how this fits into the broader Microsoft database + Fabric integration strategy, read Microsoft Databases and Microsoft Fabric: Your unified and AI-powered data estate. For Spark-specific Delta Lake concepts, the Delta Lake documentation remains the authoritative reference.

The best thing about this announcement isn’t any single feature. It’s that it makes your Spark architecture simpler by removing the parts that shouldn’t have been there in the first place.

This post was written with help from Claude Opus 4.6

What AI Functions already gave you (and where it fell short)

ExtractLabel: the schema contract your pipeline needs

The five schema features that actually matter

What this means for your Spark pipelines right now

Migration risks worth thinking about

When to use ExtractLabel vs. other approaches

The bottom line

Share this:

Why Spark teams should care about a Dataflow feature

What Recent data changes under pressure

What this feature doesn’t rescue you from

Preview discipline: run this like a survival checklist

The operating model for Spark leads

Share this:

What Eventstreams SQL actually fixes

The merge logic decision you cannot avoid

Checkpoints: boring, critical, and often broken by accident

The small files problem is not glamorous, but it will hurt you

Deletes: decide your philosophy before compliance decides for you

Monitoring: minimum viable or maximum regret

A practical starting playbook

Final take

Share this:

What actually shipped

Why Spark engineers should care

The non-negotiable constraints

Practical rollout plan for Spark teams

1) Inventory duplication first

2) Stand up a dedicated execution dataflow

3) Wrap Execute Query behind one notebook utility

4) Add governance checks before scale

5) Monitor capacity from day one

Where this fits (and where it doesn’t)

The organizational shift hiding behind the API

Source notes

Share this:

Signal 1: Get your production code out of notebooks

Signal 2: Direct Lake changes how you write to your lakehouse

Signal 3: CI/CD for Fabric just got real

The deeper pattern

Your rollout checklist for March

Share this:

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

What the connector actually is

Reading: the part that mostly just works

Writing: the part with surprises

How writes actually work under the hood

Save modes

The timestamp_ntz gotcha

What you can’t write to

Private Link limitations

Time Travel is not supported

When to use Warehouse vs. Lakehouse as your serving layer

A concrete pattern: Spark ETL → Warehouse serving layer

Cross-database queries: the glue between them

Performance notes from the field

The honest summary

Share this:

Your operational data now lands in OneLake automatically

How this changes daily Spark development

Migration risks worth planning for

A rollout checklist for Spark teams

Vector search: a side door into AI workloads

What to expect next

Share this:

The ETL Pipeline You Can Delete

What This Actually Looks Like in Practice

The Migration Risk Nobody’s Talking About

The AI Angle Matters for Spark Teams Too

The Rollout Checklist

Share this: