The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

There’s a connector that ships with every Fabric Spark runtime. It’s pre-installed. It requires no setup. And it lets your Spark notebooks read from—and write to—Fabric Data Warehouse tables as naturally as they read Delta tables from a Lakehouse.

Most Fabric Spark users don’t know it exists. The ones who do often run into the same three or four surprises. Let’s fix both problems.

What the connector actually is

The Spark connector for Fabric Data Warehouse (synapsesql) is a built-in extension to the Spark DataFrame API. It uses the TDS protocol to talk directly to the SQL engine behind your Warehouse (or the SQL analytics endpoint of a Lakehouse). You get read and write access to Warehouse tables from PySpark, Scala Spark, or Spark SQL.

One line of code to read:

from com.microsoft.spark.fabric.Constants import Constants  df = spark.read.synapsesql("my_warehouse.dbo.sales_fact")

One line to write:

df.write.mode("append").synapsesql("my_warehouse.dbo.sales_fact")

No connection strings. No passwords. No JDBC driver management. Authentication flows through Microsoft Entra—same identity you’re logged into your Fabric workspace with. The connector resolves the SQL endpoint automatically based on workspace context.

That’s the happy path. Now let’s talk about what actually happens when you use it.

Reading: the part that mostly just works

Reading from a Warehouse table into a Spark DataFrame is the connector’s strength. The synapsesql() call supports the full three-part naming convention: warehouse_name.schema_name.table_or_view_name. It works for tables and views, including views with joins across schemas.

A few things that are genuinely useful:

Predicate pushdown works. When you chain .filter() or .limit() onto your DataFrame, the connector pushes those constraints to the SQL engine. You’re not pulling the full table into Spark memory and then filtering—the SQL engine handles the filter and sends back the subset. This matters when your Warehouse tables have hundreds of millions of rows and you only need a time-sliced sample.

df = spark.read.synapsesql("my_warehouse.dbo.sales_fact") \     .filter("order_date >= '2026-01-01'") \     .select("order_id", "customer_id", "amount")

Cross-workspace reads work. If your Warehouse lives in a different workspace than your notebook’s attached Lakehouse, you pass the workspace ID:

df = spark.read \     .option(Constants.WorkspaceId, "<target-workspace-id>") \     .option(Constants.DatawarehouseId, "<warehouse-item-id>") \     .synapsesql("my_warehouse.dbo.sales_fact")

This is genuinely powerful for hub-and-spoke architectures where your curated Warehouse sits in a production workspace and your data science notebooks live in a sandbox workspace.

Parallel reads are available. For large tables, you can partition the read across multiple Spark tasks, similar to spark.read.jdbc:

df = spark.read \     .option("partitionColumn", "order_id") \     .option("lowerBound", 1) \     .option("upperBound", 10000000) \     .option("numPartitions", 8) \     .synapsesql("my_warehouse.dbo.sales_fact")

This splits the query into eight parallel reads, each fetching a range of order_id. Without this, you get a single-threaded read that will bottleneck on large tables.

Security models pass through. If your Warehouse has object-level security (OLS), row-level security (RLS), or column-level security (CLS), those policies are enforced when Spark reads the data. Your notebook sees exactly what your identity is authorized to see. This is a meaningful difference from reading Delta files directly via OneLake, where security operates at the workspace or folder level.

Custom T-SQL queries work too. You’re not limited to reading tables—you can pass arbitrary T-SQL:

df = spark.read \     .option(Constants.DatabaseName, "my_warehouse") \     .synapsesql("SELECT TOP 1000 * FROM dbo.sales_fact WHERE region = 'WEST'")

This is handy for complex aggregations or when you want the SQL engine to do the heavy lifting before data enters Spark.

Writing: the part with surprises

Write support for the Spark-to-Warehouse connector became generally available with Runtime 1.3. It works, and it solves a real architectural problem—but it has mechanics you need to understand.

How writes actually work under the hood

The connector uses a two-phase process:

Stage: Spark writes your DataFrame to intermediate Parquet files in a staging location.
Load: The connector issues a COPY INTO command, telling the Warehouse SQL engine to ingest from the staged files.

This is the same COPY INTO pattern that powers bulk ingestion into Fabric Data Warehouse generally. It’s optimized for throughput. It is not optimized for latency on small writes.

If you’re writing a DataFrame with 50 rows, the overhead of staging files and issuing COPY INTO means the write takes materially longer than you’d expect. For small, frequent writes, this connector is not the right tool. Use T-SQL INSERT statements through a SQL connection instead.

For batch writes of thousands to millions of rows, the connector performs well. The COPY INTO path is what the Warehouse was designed for.

Save modes

The connector supports four save modes:

errorifexists (default): Fails if the table already exists.
ignore: Silently skips the write if the table exists.
overwrite: Drops and recreates the table with new data.
append: Adds rows to the existing table.

df.write.mode("overwrite").synapsesql("my_warehouse.dbo.daily_aggregates")

A common pattern: Spark computes daily aggregations from Lakehouse Delta tables, then writes the results to a Warehouse table that Power BI reports connect to. The Warehouse’s result set caching (now generally available as of January 2026) means subsequent Power BI refreshes hit cache instead of recomputing.

The timestamp_ntz gotcha

This is the single most common error people hit when writing to a Warehouse from Spark.

If your DataFrame contains timestamp_ntz (timestamp without time zone) columns, the write will fail. Fabric Data Warehouse expects time-zone-aware timestamps. The fix is a cast before you write:

from pyspark.sql.functions import col  for c in df.columns:     if dict(df.dtypes)[c] == "timestamp_ntz":         df = df.withColumn(c, col(c).cast("timestamp"))  df.write.mode("append").synapsesql("my_warehouse.dbo.target_table")

This is not documented prominently enough. If you see a Py4JJavaError during write that mentions type conversion, timestamps are the first thing to check.

What you can’t write to

The connector writes to Warehouse tables only. You cannot write to the SQL analytics endpoint of a Lakehouse—it’s read-only. If you try, you’ll get an error. This seems obvious but trips people up because the same synapsesql() method handles both reads from Warehouses and Lakehouse SQL endpoints.

Private Link limitations

If Private Link is enabled at the workspace level, both read and write operations through the connector are unsupported. If Private Link is enabled at the tenant level only, writes are unsupported but reads still work. This is a significant limitation for security-conscious deployments. Check your network configuration before building pipelines that depend on this connector.

Time Travel is not supported

Fabric Data Warehouse now supports Time Travel queries. However, the Spark connector does not pass through Time Travel syntax. If you need to query a table as of a specific point in time, you’ll need to use a T-SQL connection directly rather than the synapsesql() method.

When to use Warehouse vs. Lakehouse as your serving layer

This is the architectural question that the connector’s existence forces you to answer. You’ve got data in your Lakehouse. Spark has transformed it. Now where does it go?

Use Lakehouse Delta tables when:

Your consumers are other Spark notebooks or Spark-based ML pipelines.
You need schema evolution flexibility (Delta’s schema merge).
You want to use OPTIMIZE, VACUUM, and Z-ORDER for table maintenance.
Your data scientists need direct file access through OneLake APIs.

Use Warehouse tables when:

Your primary consumers are Power BI reports or T-SQL analysts.
You need the Warehouse’s result set caching for repeated query patterns.
You need fine-grained security (RLS, CLS, OLS) that passes through to all consumers.
You want to use T-SQL stored procedures, views, and MERGE statements for downstream transformations.
You need cross-database queries that join Warehouse tables with Lakehouse tables or other Warehouse tables.

Use both when:

Spark processes and stores data in the Lakehouse (bronze → silver → gold medallion layers), then the connector writes final aggregations or serving tables to the Warehouse.
The Warehouse serves as the “last mile” between your data engineering work and your business intelligence layer.

The January 2026 GA of MERGE in Fabric Data Warehouse makes the “write to Warehouse” pattern significantly more useful. You can now do incremental upserts: Spark writes a staging table, then a T-SQL MERGE reconciles it with the target. This is a common pattern in data warehousing that was previously awkward in Fabric.

A concrete pattern: Spark ETL → Warehouse serving layer

Here’s the pattern I see working well in production:

# 1. Read from Lakehouse Delta tables (Spark native) bronze = spark.read.format("delta").load("Tables/raw_orders")  # 2. Transform in Spark silver = bronze.filter(col("status") != "cancelled") \     .withColumn("order_date", col("order_ts").cast("date")) \     .withColumn("amount_usd", col("amount") * col("fx_rate"))  gold = silver.groupBy("region", "order_date") \     .agg(         count("order_id").alias("order_count"),         sum("amount_usd").alias("total_revenue")     )  # 3. Write to Warehouse for Power BI consumption gold.write.mode("overwrite").synapsesql("analytics_warehouse.dbo.daily_revenue")

The Lakehouse owns the raw and transformed data. Spark does the heavy compute. The Warehouse serves the final tables to downstream consumers with T-SQL access, caching, and fine-grained security.

The alternative—writing gold tables to the Lakehouse and having Power BI connect via the SQL analytics endpoint—also works. But the SQL analytics endpoint has a metadata sync delay after Spark writes new data. The Warehouse table is immediately consistent after the COPY INTO completes. If your reporting needs to reflect the latest pipeline run without a sync lag, the Warehouse path is more reliable.

Cross-database queries: the glue between them

Once you have data in both a Lakehouse and a Warehouse in the same workspace, you can query across them using T-SQL cross-database queries from the Warehouse:

SELECT w.customer_id, w.total_revenue, l.customer_segment FROM analytics_warehouse.dbo.daily_revenue AS w JOIN my_lakehouse.dbo.customer_dim AS l     ON w.customer_id = l.customer_id

This means your Warehouse doesn’t need to contain all the data. It can hold the curated aggregations while joining against dimension tables that live in the Lakehouse. No data movement. No duplication. The SQL engine resolves both sources through OneLake.

Performance notes from the field

A few observations from real workloads:

Reads are faster than you expect. The TDS protocol connection to the Warehouse SQL engine is efficient. For typical analytical queries returning thousands to low millions of rows, the synapsesql() read is competitive with reading Delta files directly, especially when the Warehouse has statistics and result set caching enabled.

Writes are slower than Lakehouse writes. The two-phase staging + COPY INTO process adds overhead versus a direct df.write.format("delta").save() to Lakehouse tables. For a DataFrame with 10 million rows, expect the Warehouse write to take 2-5x longer than an equivalent Lakehouse Delta write. This is the tradeoff for getting immediate T-SQL access with full Warehouse capabilities.

Use parallel reads for large tables. The default single-partition read will bottleneck. Set numPartitions to match your Spark cluster’s available cores for large reads. The performance improvement is often 4-8x.

Proactive and incremental statistics refresh. As of January 2026, Fabric Data Warehouse supports proactive statistics refresh and incremental statistics. This means the query optimizer keeps statistics up to date automatically. Your synapsesql() reads benefit from better query plans without manual UPDATE STATISTICS calls.

The honest summary

The Spark connector for Fabric Data Warehouse is a well-designed bridge between two systems that many teams use side by side. It makes the read path simple and the write path possible without leaving your Spark notebook.

It is not a replacement for writing to Lakehouse Delta tables. It is an additional output path for when your downstream consumers need T-SQL, fine-grained security, result set caching, or immediate consistency. Use it when the Warehouse is the right serving layer. Don’t use it when Lakehouse is sufficient.

The biggest wins come from combining both: Spark for compute, Lakehouse for storage, Warehouse for serving. The connector is the plumbing that makes that architecture work without data pipelines in between.

If you’re heading to FabCon Atlanta (March 16-20, 2026), both the Data Warehouse and Data Engineering teams will be there. It’s a good place to pressure-test your architecture and see what’s coming next.

This post was written with help from anthropic/claude-opus-4-6

Fabric Spark billing just got clearer. Here’s how to make the most of it.

Somewhere in a shared Teams channel, a Fabric capacity admin is looking at the Capacity Metrics app and noticing Spark consumption is down 15% overnight. Same notebooks. Same schedules. Same engineers shipping code with the same amount of caffeine.

A quick thread later, the answer is clear: nothing is wrong. Microsoft introduced new billing operations, and AI usage is now visible in its own category.

That’s not a cost increase. That’s better instrumentation.

What actually changed

On February 13, 2026, Microsoft announced two new billing operations for Fabric: AI Functions and AI Services.

Previously, AI-related usage in notebooks was grouped under Spark operations. Calls made through fabric.functions, Azure OpenAI REST API, the Python SDK, and SynapseML were all reported in Spark. Text Analytics and Azure AI Translator calls from notebooks were also reflected there.

Now those costs are separated:

AI Functions covers Fabric AI function calls and Azure OpenAI Service usage in notebooks and Dataflows Gen2.
AI Services covers Text Analytics and Azure AI Translator usage from notebooks.

Both are billed under the Copilot and AI Capacity Usage CU meter.

Important: consumption rates did not change. You pay the same for the same work. What changed is visibility.

Why this reporting update is a win for operators

If you’ve ever tried to explain Spark trends that include hidden AI consumption, this update helps immediately.

Picture an F64 capacity. You historically allocated 70% of CU budget to Spark because that’s what Capacity Metrics showed. But Spark previously included AI consumption, so the category was doing two jobs at once.

Now Spark and AI can each tell their own story. That’s useful for:

more accurate workload attribution
cleaner alerting by operation type
better planning conversations with finance and platform teams

In other words: same total spend, sharper signal.

The migration checklist

There’s nothing to deploy and no code changes required. The opportunity is operational: update your monitoring and planning so you can benefit from the new detail right away.

1. Audit your AI function usage

Before the new operations appear in your Metrics app, find AI calls in your codebase. Search notebooks for:

fabric.functions calls
Azure OpenAI REST API calls (look for /openai/deployments/)
openai Python SDK usage within Fabric notebooks
SynapseML OpenAI transformers
Text Analytics API calls
Azure AI Translator calls

If there are no hits, this billing split likely won’t affect your current workloads. If there are many hits (common in mature notebook estates), estimate volume now so your post-change analysis is faster.

2. Baseline your current Spark consumption

Export the last 30 days of Capacity Metrics data for Spark operations and save it.

This is your before-state. After rollout, validate that total consumption (Spark + new AI operations) aligns with historical Spark totals. If it aligns, you’ve confirmed a reporting change. If not, you have a clear starting point for investigation.

3. Adjust your alerting thresholds

If you monitor Spark CU consumption via Capacity Metrics, Azure Monitor, or custom API polling, update thresholds after the split.

Recommended approach:

take your current Spark threshold
subtract estimated AI consumption from step 1
set that as the revised Spark threshold
add a separate alert for the Copilot and AI meter

If AI estimates are still rough, start with a conservative threshold and tune after a few weeks of separated data.

4. Update your capacity planning models

Add a dedicated row for AI consumption in any spreadsheet, Power BI report, or planning document that allocates CU budget by operation type.

The Copilot and AI Capacity Usage CU meter already existed for Copilot scenarios, but this may be the first time many Spark-first teams see meaningful workload usage there. Adding it now makes future reviews easier.

5. Set up a validation window

Choose a date after March 17 (when the new operations start appearing) and compare pre/post totals:

pre-change: Spark total
post-change: Spark + AI Functions + AI Services

Expect close alignment (allowing for normal workload variation and rounding). If variance is more than a few percent, open a support ticket. Microsoft described this as a reporting-only change with no rate modifications.

6. Share a quick team note before questions start

One short update prevents a lot of confusion:

“Microsoft is separating AI consumption from Spark billing into dedicated operations. Total cost is unchanged. Spark will appear lower, and Copilot and AI will appear higher. This improves visibility and tracking.”

That gives engineers context and helps finance teams interpret new categories correctly on day one.

Post-rollout checks that keep things clean

Consumption variance check. If post-change totals (Spark + AI Functions + AI Services) differ significantly from pre-change Spark trends, compare equivalent workload windows and rule out schedule, code, or capacity changes.

Expected operation visibility. If you confirmed AI usage in step 1 but AI Functions shows zero, check regional rollout timing from the Fabric blog before escalating.

Why separated AI spend is valuable

This platform-side categorization update gives teams a better lens on where capacity is being used.

Once AI usage is measurable independently, you can answer higher-quality questions:

Which AI workflows are creating the most value per CU?
Which calls are production-critical versus experimental leftovers?
Where should you optimize first for performance and cost?

That is exactly the kind of visibility mature platform teams want.

What this signals about Fabric billing

As Fabric workloads evolve, billing categories will continue to become more descriptive. That’s a good thing. Better category design means better operational decisions.

The admin in that Teams thread got clarity quickly: Spark wasn’t shrinking, observability was improving. Once the team updated dashboards and alerts, they had a more useful capacity model than they had the week before.

That’s the real upgrade here.

This post was written with help from anthropic/claude-opus-4-6

From Demo to Production: ML-Enriched Power BI in Microsoft Fabric

Microsoft published a new end-to-end pattern last week. Train a model inside Fabric. Score it against a governed semantic model. Push predictions straight into Power BI. No data exports. No credential juggling.

The blog post walks through a churn-prediction scenario. Semantic Link pulls data from a governed Power BI semantic model. MLflow tracks experiments and registers models. The PREDICT function runs batch inference in Spark. Real-time endpoints serve predictions through Dataflow Gen2. Everything lives in one workspace, one security context, one OneLake.

It reads well. It demos well.

But demo code is not production code. The gap between “it runs in my notebook” and “it runs every Tuesday at 4 AM without paging anyone” is exactly where Fabric Spark teams bleed time.

This is the checklist for crossing that gap.

Prerequisites that actually matter

The official blog assumes a Fabric-enabled workspace and a published semantic model. That is the starting line. Production is a different race.

Capacity planning comes first. Fabric Spark clusters consume capacity units. A batch scoring job running on an F64 during peak BI refresh hours competes for the same CUs your report viewers need. Run scoring in off-peak windows, or provision a separate capacity for data science workloads. Either way, know your CU ceiling before your first experiment. Discovering your scoring job throttles the CFO’s dashboard refresh is not a conversation you want to have.

Workspace isolation is not optional. Dev, test, prod. Semantic models promoted through deployment pipelines. ML experiments pinned to dev. Registered models promoted to prod only after validation passes. If your team trains models in the same workspace where finance runs their quarterly close dashboard, you are one accidental publish away from explaining why the revenue numbers just changed.

MLflow model signatures must be populated from day one. The PREDICT function requires them. No signature, no batch scoring. This constraint is easy to forget during prototyping and expensive to fix later. Make it a rule: every mlflow.sklearn.log_model call includes an infer_signature output. No exceptions. Write a pre-commit hook if you have to.

Semantic Link: the part most teams underestimate

Semantic Link connects your Power BI semantic model to your Spark notebooks. Call fabric.read_table() and you get governed data. Same measures and definitions your business users see in their reports. The data in your model’s training set matches what shows up in Power BI.

This matters more than it sounds.

Every analytics team that has been around long enough has a story about metric inconsistency. “Active customer” means one thing in the DAX model, another thing in the SQL pipeline, and a third thing in the data scientist’s Python notebook. The numbers diverge. Somebody notices. A week of forensic reconciliation follows.

Semantic Link kills that problem at the root. But only if you use it deliberately.

Start with fabric.list_measures(). Audit what DAX measures exist. Understand which ones your model depends on. Then pull data with fabric.read_table() rather than querying lakehouse tables directly. When you need to engineer features beyond what the semantic model provides, document every derivation in a version-controlled notebook. Written down and committed. Not living in someone’s memory or buried in a thread.

Training guardrails worth building

The Fabric blog shows a clean LightGBM training flow with MLflow autologging. That is the happy path. Production needs the unhappy path covered too.

Validate data before training. Check row counts against expected baselines. Check for null spikes in key columns. Check that the class distribution has not shifted beyond your predefined threshold. A model trained on corrupted or stale data produces confident garbage. Confident garbage is worse than no model at all, because people act on it.

Tag every experiment run. MLflow in Fabric supports custom tags. Use them aggressively. Tag each run with the semantic model version it pulled from, the notebook commit hash, and the data snapshot date. Three months from now, when a stakeholder asks why the model flagged 200 customers as high churn risk and zero of them actually left, you need to reconstruct exactly what happened. Without tags, you are guessing.

Build a champion-challenger gate. Before any new model version reaches production, it must beat the current model on a holdout set from the most recent data. Not any holdout set. The most recent one. Automate this comparison in a validation notebook that runs as a pipeline step before model registration. If the challenger fails to clear the margin you defined upfront, the pipeline halts. No override button. No “let’s just push it and see.” The gate exists to prevent optimism from substituting for evidence.

Batch scoring: the PREDICT function in production

Fabric’s PREDICT function is straightforward. Pass a registered MLflow model and a Spark DataFrame. Get predictions back. It supports scikit-learn, LightGBM, XGBoost, CatBoost, ONNX, PyTorch, TensorFlow, Keras, Spark, Statsmodels, and Prophet.

The production requirements are few but absolute.

Write predictions to a delta table in OneLake. Not to a temporary DataFrame that dies with the session. Partition that table by scoring date. Add a column for the model version that generated each row. This is your audit trail. When someone asks “why did customer 4471 show as high risk last Tuesday?”, you pull the partition, check the model version, and have an answer in minutes. Without that structure, the same question costs you a day.

Chain your scoring job to run after your semantic model refresh. Sequence matters. If the model scores data from the prior refresh cycle, your predictions are one step behind reality. Use Fabric pipelines to enforce the dependency explicitly. Refresh completes, scoring starts.

Real-time endpoints: know exactly what you are signing up for

Fabric now offers ML model endpoints in preview. Activate one from the model registry. Fabric spins up managed containers and gives you a REST API. Dataflow Gen2 can call the endpoint during data ingestion, enriching rows with predictions in flight.

The capability is real. The constraints are also real.

Real-time endpoints support a limited set of model flavors: Keras, LightGBM, scikit-learn, XGBoost, and (since January 2026) AutoML-trained models. PyTorch, TensorFlow, and ONNX are not supported for real-time serving. If your production model uses one of those frameworks, batch scoring is your only path.

The auto-sleep feature deserves attention. Endpoints scale capacity to zero after five minutes without traffic. The first request after sleep incurs a cold-start delay while containers spin back up. For use cases that need consistent sub-second latency, you have two options: disable auto-sleep and accept the continuous capacity cost, or send periodic synthetic requests to keep the endpoint warm.

The word “preview” is load-bearing here. Preview means the API can change between updates. Preview means SLAs are limited. Preview means you need a batch-scoring fallback in place before you route any production workflow through a real-time endpoint. Build the fallback first. Test it. Then add the real-time path as an optimization on top.

The rollback plan you need to write before you ship

Most teams build forward. They write the training pipeline, the scoring job, the endpoint, the Power BI report that consumes predictions. Then they ship.

Nobody writes the backward path. Until something goes wrong.

Your rollback plan has three parts.

First, keep at least two prior model versions in the registry. If the current version starts producing bad predictions, you roll back by updating the model alias. One API call. The scoring pipeline picks up the previous version on its next run.

Second, partition prediction tables by date and model version. Rolling back a model means nothing if downstream reports still display the bad predictions. With partitioned tables, you can filter or drop the scoring run from the misbehaving version and revert to the prior run’s output.

Third, a kill switch for real-time endpoints. One API call to deactivate the endpoint. Traffic falls back to the latest batch-scored delta table. Your Power BI report keeps working, just without real-time enrichment, while you figure out what went wrong.

Test this plan. Not on paper. Run the rollback end to end in your dev environment. Time it. If reverting to a stable state takes longer than fifteen minutes, your plan is too complicated. Simplify it until the timer clears.

Ship it

The architecture Microsoft described is sound. Semantic Link for governed data access. MLflow for experiment tracking and model registration. PREDICT for batch scoring to OneLake. Real-time endpoints for low-latency enrichment. Power BI consuming prediction tables through DirectLake or import.

But architecture alone does not keep a system running at 4 AM. The capacity plan does. The workspace isolation does. The data validation gate, the champion-challenger check, the scoring sequence, the endpoint fallback, the rollback drill. Those are what separate a demo from a service.

Do the checklist. Test the failure modes. Then ship.

This post was written with help from anthropic/claude-opus-4-6

Microsoft Fabric Warehouse + Spark: Interoperability Patterns That Actually Work

If you’ve spent any time in a Fabric workspace with both Data Engineering (Spark) and Data Warehouse, you’ve probably had this moment:

Spark is great for big transformations, complex parsing, and “just let me code it.”
The Warehouse is great for a curated SQL model, concurrency, and giving the BI world a stable contract.
And yet… teams still end up copying data around like they’re paid by the duplicate.

The good news: Fabric’s architectural bet is that OneLake + Delta is the contract surface across engines. That means you can design a pipeline where Spark and Warehouse cooperate instead of competing.

This post is a practical field guide to the integration patterns that work well in real projects:

3-part naming over the SQL endpoint (zero-copy default) – query Lakehouse Delta tables directly from Warehouse SQL without moving data.
Spark → Warehouse (file-based ingest) using COPY INTO and OPENROWSET over OneLake paths – when workload evidence calls for materialization.
Spark → Warehouse (table-to-table ingest) using cross-database queries / CTAS / INSERT…SELECT – same trigger.
Warehouse → Spark (read-only consumption) by reading the Warehouse table’s published Delta logs from Spark.

Along the way, I’ll call out the trade-offs, the gotchas, and the operational guardrails that keep teams out of trouble.

Mental model: OneLake is the handshake

In Fabric, multiple experiences can produce and consume Delta Lake tables. Microsoft Learn describes Delta Lake as the standard analytics table format in Fabric, and notes that Delta tables produced by one engine (including Fabric Data Warehouse and Spark) can be consumed by other engines.

So instead of thinking “Spark output” and “Warehouse tables” as two unrelated worlds, treat them as:

A shared storage plane (OneLake)
An open table format (Delta + Parquet)
Two compute engines with different strengths

The rest is just choosing where to materialize — or whether to materialize at all.

Start here: 3-Part Naming over the SQL Endpoint

Before you copy anything, ask: do I actually need a separate materialized table?

Fabric’s SQL analytics endpoint automatically exposes every Lakehouse Delta table as a queryable SQL object. From the Warehouse, you can reference those tables directly using 3-part naming:

SELECT * FROM MyLakehouse.dbo.clean_sales WHERE OrderDate >= '2026-01-01';

No COPY INTO. No CTAS. No duplicate storage. The query runs against the Lakehouse’s Delta files through the SQL endpoint — zero-copy interoperability out of the box.

When this is enough (and it often is)

Ad-hoc analytics and exploration across Spark-produced datasets.
Lightweight joins between Warehouse dimensions and Lakehouse facts.
BI semantic models that don’t need sub-second concurrency at scale.
Early-stage projects where the workload profile isn’t settled yet.

When to materialize instead

Materialize into dedicated Warehouse tables (COPY INTO, CTAS, INSERT…SELECT) when workload evidence justifies it:

High concurrency: many concurrent queries hitting the same dataset consistently.
Recurring heavy joins/aggregations: repeated complex queries where pre-materialized tables measurably reduce compute.
Stricter SLA / CU predictability: when you need tighter control over query performance and capacity consumption.
Governance boundaries: when the Warehouse should own and version the serving-layer schema independently from the Lakehouse.

If none of those conditions apply, 3-part naming is the right default. You can always materialize later when the numbers say you should.

The CU tradeoff

Virtualization (3-part naming) shifts cost to query-time: every read traverses the SQL endpoint and pays CU at execution. Materialization (COPY INTO / CTAS) pays an ingestion and storage cost once, so repeated reads are faster and more predictable in CU terms. Neither is universally better — the right call depends on query frequency, data volume, and your capacity budget.

Pattern 1 — Spark → Warehouse via OneLake files (COPY INTO + OPENROWSET)

When to use it

Start with 3-part naming. Reach for COPY INTO / OPENROWSET file-based ingest only when workload evidence (sustained concurrency pressure, SLA requirements, or CU unpredictability) tells you virtualization isn’t enough. This pattern fits when:

Your Spark pipeline already produces files (Parquet/CSV/JSONL) under a Lakehouse Files path.
You need faster or more predictable query performance than the SQL endpoint provides for this dataset.
You want a clean separation: Spark writes files; Warehouse owns the serving tables.

Step 1: Write a “handoff” dataset from Spark

In Spark, write a handoff dataset into the Lakehouse Files area (not Tables). Conceptually:

(   df   .write   .mode("overwrite")   .format("parquet")   .save("Files/handoff/sales_daily/") )

Why Files? Because the Warehouse can point COPY INTO / OPENROWSET at file paths, and the Files area is designed to hold arbitrary file layouts.

Step 2: Inspect the file shape from the Warehouse (OPENROWSET)

Before you ingest, use OPENROWSET to browse a file (or a set of files) and confirm the schema is what you think it is.

Microsoft Learn documents that Fabric Warehouse OPENROWSET can read Parquet/CSV files, and that the files can be stored in Azure Blob Storage, ADLS, or Fabric OneLake (with OneLake reads called out as preview).

SELECT TOP 10 * FROM OPENROWSET(   BULK 'https://onelake.dfs.fabric.microsoft.com/<workspaceId>/<lakehouseId>/Files/handoff/sales_daily/*.parquet' ) AS rows;

Step 3: Ingest into a Warehouse table (COPY INTO)

The Fabric blog announcement for OneLake as a source for COPY INTO and OPENROWSET highlights the point of this feature: load and query Lakehouse file folders without external staging storage or SAS tokens.

COPY INTO dbo.SalesDaily FROM 'https://onelake.dfs.fabric.microsoft.com/<workspaceId>/<lakehouseId>/Files/handoff/sales_daily/' WITH (   FILE_TYPE = 'PARQUET' );

Operational guardrails

Treat the Files path as a handoff contract: version it, keep it predictable, and don’t “just drop random stuff in there.”
If you’ll query the same external data repeatedly, ingest it into a dedicated Warehouse table (Microsoft Learn notes repeated OPENROWSET access can be slower than querying a table).

Pattern 2 – Spark → Warehouse via in-workspace tables (CTAS / INSERT…SELECT)

When to use it

As with Pattern 1, start with 3-part naming and materialize via CTAS / INSERT…SELECT only when workload metrics confirm you need it. This pattern fits when:

Your Spark output is naturally a Delta table (Lakehouse Tables area) and 3-part naming queries against it hit concurrency or performance limits.
You want the Warehouse to own a curated serving-layer model (joins, dimensional modeling, computed columns) with predictable CU spend.
You prefer SQL-native table-to-table pipelines over file-level ingestion.

Step 1: Produce a curated Delta table with Spark

(   df_clean   .write   .mode("overwrite")   .format("delta")   .save("Tables/clean_sales") )

Step 2: Materialize a Warehouse table from the Lakehouse table

Microsoft Learn notes that for T-SQL ingestion, you can use patterns like INSERT…SELECT, SELECT INTO, or CREATE TABLE AS SELECT (CTAS) to create or update tables from other items in the same workspace (including lakehouses).

CREATE TABLE dbo.FactSales AS SELECT   OrderDate,   StoreId,   ProductId,   Quantity,   NetAmount FROM MyLakehouse.dbo.clean_sales;

For incremental loads you’ll often end up with a staging + merge strategy, but the key idea stays the same: Spark produces the curated dataset; the Warehouse owns the serving tables.

Pattern 3 – Warehouse → Spark via published Delta logs (read-only)

This is the pattern that surprises people (in a good way): the Warehouse isn’t a closed box.

Microsoft Learn documents that Warehouse user tables are stored in Parquet, and that Delta Lake logs are published for all user tables. The key consequence is that any engine that can read Delta tables can get direct access to Warehouse tables – read-only.

Step 1: Get the OneLake path for a Warehouse table

In the Warehouse UI, table Properties exposes the table’s URL / ABFS URI (Learn walks through the steps).

Step 2: Read the Warehouse table from Spark (read-only)

warehouse_table_path = "abfss://<workspace>@onelake.dfs.fabric.microsoft.com/<warehouseId>/Tables/dbo/FactSales"  fact_sales_df = spark.read.format("delta").load(warehouse_table_path)

This access is read-only from Spark. Writes must go through the Warehouse to maintain ACID compliance.
Delta log publishing is a background process after commits, so treat cross-engine visibility as “near real-time,” not “every millisecond.”

Bonus control: pause Delta log publishing

The same Learn doc describes an operational lever you can use when you need stability during a large set of changes:

ALTER DATABASE CURRENT SET DATA_LAKE_LOG_PUBLISHING = PAUSED; -- ... bulk updates ... ALTER DATABASE CURRENT SET DATA_LAKE_LOG_PUBLISHING = AUTO;

When publishing is paused, other engines see the pre-pause snapshot; Warehouse queries still see the latest.

Choosing an ownership model (so you don’t end up with two sources of truth)

The integration is easy. The contract is the hard part.

A simple rule that prevents a lot of pain:

If Spark is writing it: Warehouse can ingest it, but Spark owns the dataset.
If Warehouse is writing it: Spark can read it, but Warehouse owns the dataset.

In other words: pick one writer.

For most analytics teams, a good default is:

Spark owns bronze/silver (raw + cleaned Delta in the Lakehouse)
Warehouse owns gold (facts/dimensions, KPI-ready serving tables) — but “owns” doesn’t always mean “physically copies.” A cross-database query via 3-part naming can serve gold-layer reads without materialization.

Start with 3-part naming for cross-engine reads. Materialize across the boundary only when workload metrics — not assumptions — tell you to. Remember: virtualization shifts CU cost to query-time; materialization front-loads ingestion and storage so repeated reads are cheaper and more predictable. Let your actual usage patterns decide.

Quick checklist: production-hardening the Spark ↔ Warehouse boundary

Make the handoff explicit (a specific Files path or a specific Lakehouse table).
Version your schema (breaking changes should be intentional and tested).
Avoid singleton inserts into Warehouse; prefer bulk patterns (CTAS, INSERT…SELECT).
Validate row counts and freshness after each load (and alert on drift).
Treat Delta log publishing as eventual across engines; design your BI/ML expectations accordingly.

Summary

Fabric is at its best when you let each engine do what it’s good at:

Spark for transformation, enrichment, and complex data engineering logic.
Warehouse for the curated serving model and SQL-first consumers.

OneLake + Delta is the glue. Start with 3-part naming for zero-copy interoperability across engines, and materialize only when workload evidence justifies the extra storage and ingestion cost. That way you get the simplicity of one logical data layer without paying for copies you don’t need.

This post was written with help from Opus 4.6

References

Delta Lake table format interoperability (Microsoft Learn)
Delta Lake Logs in Warehouse (Microsoft Learn)
SQL analytics endpoint of the Lakehouse (Microsoft Learn)
Cross-database querying in Fabric Warehouse (Microsoft Learn)
Ingest data into the Warehouse (Microsoft Learn)
Browse file content with OPENROWSET (Microsoft Learn)
OneLake as a source for COPY INTO and OPENROWSET (Preview) (Microsoft Fabric Blog)

What SQL database in Fabric actually means for your Spark pipelines

There is a particular kind of excitement that sweeps through data engineering teams when Microsoft announces a new database option. It is the same mixture of curiosity and low-grade dread you might feel upon learning that your neighborhood is getting a new highway interchange. Useful, probably. Disruptive, definitely. Someone is going to have to figure out the on-ramps.

SQL database in Fabric went generally available in November 2025. Built on the same SQL Database Engine that powers Azure SQL Database, it is the first fully SaaS-native operational database living inside Microsoft Fabric. More than 50,000 SQL databases were created during the preview period alone. If you spend your days writing Spark notebooks, building lakehouses, and tending ETL pipelines, this thing will change how you work whether you plan for it or not.

Here is what you need to know, what you should actually do about it, and where the potholes are hiding.

Your operational data now lands in OneLake automatically

The headline feature for Spark teams is automatic replication to OneLake. When data gets written to a SQL database in Fabric, it mirrors to OneLake as Delta tables in near real-time. No pipelines. No connectors. No orchestration jobs that fail silently at 2 AM and ruin your Monday.

This sounds almost too convenient, and in some ways it is. The mirrored Delta tables arrive in an open format your Spark notebooks can read directly. You point a DataFrame at the mirrored location, run your transformations, and push results to your gold layer without ever having written an ingestion pipeline for that source.

If your team currently runs nightly batch loads from Azure SQL or SQL Server into a lakehouse, this is a real shift. That entire category of extract-and-load work can shrink or vanish. But “can” is doing heavy lifting in that sentence, and we need to talk about why.

How this changes daily Spark development

The practical impact shows up in a few specific places.

Reading operational data gets simpler. Instead of maintaining JDBC connections, managing credential rotation, and writing Spark code to pull from SQL, you read Delta tables from OneLake. The data is already there. Your Spark cluster does not need network access to the SQL database itself. One fewer firewall rule, one fewer connection string in your key vault, one fewer thing that breaks when someone rotates a password on a Friday afternoon.

Schema changes arrive faster than you can react. With batch ETL, you had a buffer. The pipeline would fail, someone would get an alert, and you had time to adapt your downstream notebooks. Near real-time mirroring removes that cushion. A column rename or type change in the operational database shows up in your Delta tables within seconds to minutes. If your Spark jobs reference columns by name (they do), you need schema evolution handling that most teams have not built yet.

Think about what happens when a developer on the application side renames customer_id to cust_id on a Wednesday afternoon. Your batch pipeline would have failed that night, you would have caught it Thursday morning, and the fix would be a one-line column alias. With mirroring, your running Spark job gets a AnalysisException: cannot resolve 'customer_id' mid-stream. The fix is the same, but the timing is worse.

SQL users can now query your lakehouse data directly. SQL database in Fabric supports OPENROWSET and External Tables for querying OneLake data in CSV, Parquet, and JSON formats. Your SQL-writing colleagues can query lakehouse data without Spark. That sounds like a collaboration win until a SQL user runs a full table scan on your carefully partitioned Parquet files and you both learn something new about capacity throttling.

Establish clear ownership of shared datasets early. Document which OneLake paths are read-safe for SQL access and which ones carry performance risk.

The SQL Analytics Endpoint changes reporting paths. Every SQL database in Fabric gets a SQL Analytics Endpoint that sits on top of the mirrored data. Power BI can hit this endpoint with Direct Lake, which means your Spark team might no longer be in the critical path for building reporting datasets. If you have spent months building and maintaining a medallion architecture primarily to serve Power BI, parts of that effort become optional. Whether that feels like relief or irrelevance depends on your org chart.

Migration risks worth planning for

Before you start ripping out pipelines, here are the things that deserve a red flag on your project board.

Capacity billing is shared, and the math is unforgiving. SQL database in Fabric consumes the same Fabric capacity as your Spark jobs, warehouses, and Power BI refreshes. If someone provisions a heavily used SQL database on the same capacity where your Spark notebooks run, you will feel it. Fabric capacity is a zero-sum game. The new player at the table did not bring extra chips.

Run a two-week trial on a dedicated capacity before mixing SQL database workloads with existing Spark production. Use the Microsoft Fabric Capacity Metrics App to understand exactly how many CUs the database consumes at rest and under load.

Near real-time is not real-time, and the gap varies. The mirroring latency depends on transaction volume and capacity pressure. Under light load, changes appear in seconds. Under heavy load on a congested capacity, you might see minutes of lag. If your Spark pipelines assume data completeness at a specific watermark, you need to measure actual replication lag under realistic conditions. A simple row-count comparison between the SQL database and the mirrored Delta table, run every five minutes for a week, will tell you more than any documentation.

Security boundaries do not mirror perfectly. SQL database in Fabric supports Microsoft Entra authentication, row-level security, customer-managed keys, and SQL auditing (in preview). Your lakehouse uses OneLake RBAC, workspace roles, and Spark-level access controls. The mirrored data inherits some but not all of these boundaries. Row-level security in the SQL database, for instance, does not automatically apply to the mirrored Delta table in OneLake. If you have sensitive columns, verify the access controls on the mirror before your entire data team has read access.

Vendor lock-in compounds quietly. Every pipeline you remove and every JDBC connector you delete makes you more dependent on Fabric-internal mechanisms. If you later need to run Spark on Databricks, on EMR, or on bare-metal clusters, your data ingestion path disappears. This is not a reason to avoid the feature, but it is a reason to document what you replaced and keep a migration playbook somewhere that is not a Confluence page nobody remembers exists.

A rollout checklist for Spark teams

If you are ready to start integrating SQL database in Fabric into your data engineering stack, here is a practical sequence.

Inventory your SQL-sourced pipelines. List every Spark job that reads from Azure SQL, SQL Server, or any SQL-based source via JDBC, linked services, or copy activities. Note the refresh frequency, data volume, and downstream dependencies. If you cannot produce this list in under an hour, that is itself a useful finding.
Provision a SQL database in Fabric on a non-production capacity. Do not test this on production. Capacity contention is real, and you want to understand billing impact before it appears on someone else’s finance report.
Mirror a single non-critical table and validate. Pick a reference table, something small and stable. Confirm the Delta table lands in OneLake, check the schema, verify column types, and read it from a Spark notebook. Compare row counts and checksums against the source.
Measure replication lag under real load. Insert, update, and delete rows in the SQL database and time how quickly those changes appear in the mirrored Delta table. Run this test during your normal capacity utilization window, not during off-hours when capacity is idle and results are misleadingly fast.
Test schema evolution deliberately. Add a column. Rename a column. Change a data type. Observe what happens to the mirrored Delta table and to any Spark jobs reading it. Build your error handling before this surprises you in production.
Audit security boundaries on the mirror. Check whether row-level security, column masking, or other access controls in the SQL database are reflected in the mirrored OneLake data. Document gaps and decide whether they are acceptable for your data classification. If they are not, add a data masking step between the mirror and your Spark consumers.
Run a cost comparison over two weeks. Compare the Fabric capacity consumption of the SQL database plus mirroring against your current pipeline compute costs. Include the engineering time saved, but be honest. “We saved two hours a month of pipeline maintenance” is a real number. “We saved countless engineering hours” is not.
Deprecate one pipeline as a pilot. Pick your simplest SQL-sourced pipeline, redirect the downstream Spark job to read from the mirrored Delta table, and run both paths in parallel for at least two sprints. When you are confident, decommission the old pipeline and update your runbooks.

Vector search: a side door into AI workloads

SQL database in Fabric supports the native vector data type and vector indexing. This opens up retrieval-augmented generation (RAG) patterns directly inside the database, without adding a separate vector store to your architecture.

For Spark teams building ML pipelines or feeding large language models, the value is in co-location. You can store embeddings alongside your operational data, run similarity searches in SQL, and then access the same data from Spark for model training or batch inference. A product catalog with embeddings stored as vectors in SQL can serve both a real-time search API and a nightly Spark training job without data duplication.

This will not replace Pinecone or Weaviate for teams running high-throughput similarity search at scale. But for teams running modest-scale RAG or semantic search against operational data, it removes one service from the architecture and one deployment from the on-call rotation. That is not nothing.

What to expect next

Microsoft has made it clear that SQL database in Fabric is part of a longer play to bring operational data fully into the Fabric ecosystem. The integration with Copilot in the Query Editor, support for Terraform and Fabric CLI automation, and the first-ever SQLCon conference co-located with FabCon Atlanta in March 2026 all point the same direction: the wall between transactional and analytical workloads is getting thinner.

For Spark data engineering teams, the right move is not to panic and rewrite everything. It is to understand the mechanics, run a controlled test, and make deliberate decisions about which pipelines to retire and which to keep. The highway interchange is open. You just need to figure out your on-ramp.

This post was written with help from Opus 4.6

Microsoft Fabric Table Maintenance Optimization: A Cross-Workload Survival Guide

Your Delta tables are drowning. Thousands of tiny Parquet files pile up after every streaming microbatch. Power BI dashboards stall on cold-cache queries. SQL analytics endpoints grind through fragmented row groups. And somewhere in the middle of the medallion architecture, a Spark job is rewriting perfectly good files because nobody told it they were already compacted.

This is the small-file problem at scale — and in Microsoft Fabric, where a single Delta table can serve Spark, SQL analytics endpoint, Power BI Direct Lake, and Warehouse simultaneously, it becomes a cross-workload survival situation. Microsoft recently published a comprehensive cross-workload table maintenance guide that provides a clear map out. Here’s how to use it.

Every Engine Wants Something Different

The core challenge is that each consumption engine has a different idea of what an “optimally sized” file looks like. Get this wrong and you optimize for one consumer while punishing another.

Here’s the terrain:

Spark reads efficiently across a wide range — 128 MB to 1 GB depending on table size. V-Order isn’t required and adds 15–33% write overhead. Spark cares about parallelism, not VertiPaq encoding.
SQL analytics endpoint and Warehouse want files around 400 MB with roughly 2 million rows per row group, plus V-Order enabled for an approximate 10% read improvement.
Power BI Direct Lake is the most demanding consumer. It needs V-Order (delivering 40–60% cold-cache improvement), row groups of 8 million+ rows, and minimal file count to reduce transcoding overhead.

If you serve all three from the same Gold table, you need to make deliberate tradeoffs — or maintain multiple copies optimized for different patterns. Storage is cheap relative to compute. Compute wasted on bad file layouts is not.

The Three Commands That Keep You Alive

Table maintenance in Fabric boils down to three operations: OPTIMIZE, VACUUM, and the configuration pair of auto-compaction and optimize write. Each one addresses a different failure mode.

OPTIMIZE: Bin Compaction

OPTIMIZE consolidates small files into larger ones. It is your primary weapon against file fragmentation:

-- Basic compaction
OPTIMIZE schema_name.table_name

-- With V-Order for Power BI consumers
OPTIMIZE schema_name.table_name VORDER

-- With Z-Order for selective filter queries
OPTIMIZE schema_name.table_name ZORDER BY (region, event_date)

A critical detail: OPTIMIZE is a Spark SQL command. It runs in notebooks, Spark job definitions, and the Lakehouse Maintenance UI. You cannot run it from the SQL analytics endpoint or Warehouse SQL editor.

Before you optimize blindly, use the dry-run option to assess scope:

OPTIMIZE schema_name.table_name DRY RUN

This returns the files eligible for rewriting without touching the table — essential for estimating cost before committing compute.

VACUUM: Dead File Cleanup

After OPTIMIZE rewrites files, the old versions remain on disk for time travel. VACUUM removes files the Delta log no longer references:

-- Default 7-day retention
VACUUM schema_name.table_name

-- Explicit retention
VACUUM schema_name.table_name RETAIN 168 HOURS

The default seven-day retention exists for good reason: concurrent readers and writers may still reference those files. Drop below seven days and you risk reader failures or table corruption. If you must shorten retention, set spark.databricks.delta.retentionDurationCheck.enabled to false — but think carefully before you do.

Auto-Compaction + Optimize Write: Prevention Over Cure

Rather than waiting for file fragmentation to become a problem, these two features prevent it during ingestion:

Optimize write performs pre-write compaction, generating fewer, larger files at write time:

spark.conf.set('spark.databricks.delta.optimizeWrite.enabled', 'true')

Auto-compaction evaluates partition health after each write and triggers synchronous compaction when fragmentation is detected:

spark.conf.set('spark.databricks.delta.autoCompact.enabled', 'true')

Auto-compaction is broadly beneficial and recommended for most ingestion pipelines. Microsoft’s documentation recommends auto-compaction over manually scheduled OPTIMIZE jobs for most workloads, noting it “generally outperforms scheduled compaction jobs at maximizing read/write performance.”

Optimize write, however, is workload-dependent. It adds overhead at write time to coalesce small output files into larger ones. This is valuable for write patterns that naturally produce many small files — streaming microbatch jobs, high-frequency small appends, and similar patterns. For workloads that already produce reasonably sized files (e.g., large batch ETL writing well-partitioned data), optimize write adds overhead without meaningful benefit. Do not enable it by default — evaluate your write pattern first.

The Medallion Layer Checklist

The right maintenance strategy depends on where the table sits in your medallion architecture. Here is a concrete, layer-by-layer breakdown:

Bronze (Landing Zone)

Priority: Ingestion speed
Auto-compaction: Enable (optional — can sacrifice for raw speed)
Optimize write: Workload-dependent — enable only for write patterns that produce many small files (e.g., streaming microbatch, high-frequency small appends). Do not enable by default.
V-Order: No (unnecessary write overhead)
Liquid Clustering: No
Target file size: Use Adaptive Target File Size (ATFS), which dynamically calculates the ideal target. No manual tuning needed for most workloads.
Scheduled OPTIMIZE: Optional
Rule: Never serve Bronze tables directly to SQL analytics endpoint or Power BI Direct Lake.

Silver (Curated Zone)

Priority: Balance ingestion and query performance
Auto-compaction: Enable
Optimize write: Workload-dependent — enable for streaming or small-write ingestion patterns; skip for batch ETL that already produces well-sized files.
V-Order: Optional (enable if SQL or Power BI consumers query this layer)
Liquid Clustering or Z-Order: Recommended
Target file size: Use Adaptive Target File Size (ATFS) as the default. ATFS dynamically calculates the ideal file size, eliminating the need to manually specify a target. Only consider a user-defined target file size (e.g., 128–256 MB) in advanced hyper-tuning scenarios — the vast majority of workloads should not go this route.
Scheduled OPTIMIZE: Generally unnecessary when both auto-compaction and ATFS are enabled. With ATFS, auto-compaction and OPTIMIZE operate on the same dynamic target — so auto-compaction already handles what a scheduled OPTIMIZE would do. A separate OPTIMIZE schedule only matters when ATFS is not used, since auto-compaction defaults to a 128 MB target while OPTIMIZE defaults to 1 GB, creating a compaction gap. With ATFS, this discrepancy goes away. Reserve scheduled OPTIMIZE for edge cases or tables where auto-compaction is disabled.

Gold (Serving Zone)

Priority: Read performance for analytics
Auto-compaction: Enable
Optimize write: Workload-dependent — enable for streaming or small-write ingestion into Gold tables; not required for batch loads that already produce appropriately sized files.
V-Order: Required for Power BI Direct Lake; beneficial for SQL
Liquid Clustering: Evaluate the tradeoff — provides flexibility but has high compaction cost in Runtime 1.3 (see LC section). Partitioning is often the better choice until Runtime 2.0. Use LC only when you need to evolve clustering keys or query patterns are unpredictable.
Target file size: Use Adaptive Target File Size (ATFS) as the default. ATFS dynamically selects the right file size based on your table and workload characteristics. Only deviate to a user-defined target in advanced hyper-tuning scenarios — the overwhelming majority of customers should use ATFS.
Scheduled OPTIMIZE: Generally unnecessary when both auto-compaction and ATFS are enabled — auto-compaction already targets the same dynamic size that OPTIMIZE would. Without ATFS, a scheduled OPTIMIZE may still be needed because auto-compaction (128 MB default target) leaves files smaller than OPTIMIZE’s 1 GB default target. With ATFS enabled, both operations converge on the same target, making separate scheduling redundant for most workloads.

For Gold tables serving multiple consumers, the target characteristics to keep in mind (when hyper-tuning beyond ATFS):

Consumer	V-Order	Target File Size	Row Group Size
SQL analytics endpoint	Yes	400 MB	2M rows
Power BI Direct Lake	Yes	400 MB–1 GB	8M+ rows
Spark	Optional	128 MB–1 GB	1–2M rows

Note: For most workloads, Adaptive Target File Size (ATFS) will dynamically select an appropriate target across these consumers. The table above is reference for advanced tuning only.

V-Order: Know When to Pay the Tax

V-Order applies VertiPaq-compatible sorting, encoding, and compression at write time. The performance gains for Power BI Direct Lake — 40–60% on cold-cache queries — make it indispensable for Gold-layer tables feeding dashboards. But V-Order adds 15–33% to write time and provides no inherent benefit for Spark-to-Spark pipelines.

The decision framework:

Gold tables → Power BI or SQL consumers: V-Order on.
Bronze/Silver tables → Spark pipelines only: V-Order off.
Mixed consumers: Maintain separate copies — a Spark-optimized Silver table and a V-Ordered Gold table.

Set V-Order at the table level for consistency across sessions and jobs:

ALTER TABLE schema_name.gold_table
SET TBLPROPERTIES ('delta.parquet.vorder.enabled' = 'true')

Liquid Clustering vs. Partitioning vs. Z-Order

Liquid Clustering (LC) provides flexibility where partitioning is rigid. With LC, you can change clustering keys without rewriting the entire table, and it can deliver better file skipping for queries that don’t align neatly with partition boundaries. Define it at table creation:

CREATE TABLE schema_name.events (
  id INT,
  category STRING,
  event_date DATE
) CLUSTER BY (category)

But that flexibility comes at a significant cost in Fabric Runtime 1.3. The underlying Delta 3.2 LC implementation reclusters all data every time you run OPTIMIZE — until groups of clustered files exceed 100 GB. For most tables, this means every OPTIMIZE pass rewrites the same data over and over. Compaction time grows linearly with data volume, and there is no way around it in the current runtime.

In practice, this means compaction duration grows linearly over hundreds of OPTIMIZE iterations — each pass reclusters the same data because nothing has crossed the 100 GB clustered-group threshold.

For most scenarios in Runtime 1.3, partitioning remains the better choice. If your query patterns are well-understood and stable — which covers the majority of production analytics workloads — static partitioning gives you equivalent or better file skipping at a fraction of the maintenance cost. LC makes sense when you genuinely need the flexibility to evolve clustering keys over time, or when your query patterns are unpredictable — but understand that you are paying for that flexibility with linearly growing compaction overhead on every OPTIMIZE run.

Use Z-Order when your table is already partitioned (Liquid Clustering does not work with partitioned tables) or when queries filter on two or more columns together.

One critical gotcha regardless of approach: data is only clustered when OPTIMIZE runs. Regular write operations do not apply clustering. Without a compaction strategy, you get zero benefit from Liquid Clustering — the layout never materializes.

Diagnosing Table Health

Before optimizing anything, assess where you stand:

from delta.tables import DeltaTable

details = spark.sql("DESCRIBE DETAIL schema_name.table_name").collect()[0]

print(f"Table size: {details['sizeInBytes'] / (1024**3):.2f} GB")
print(f"Number of files: {details['numFiles']}")

avg_file_mb = (details['sizeInBytes'] / details['numFiles']) / (1024**2)
print(f"Average file size: {avg_file_mb:.2f} MB")

Healthy tables have evenly distributed file sizes within 2× of each other. Files under 25 MB signal fragmentation. Files over 2 GB reduce parallelism. Use DESCRIBE HISTORY to review write patterns and check whether auto-compaction has been running.

Set It at the Table Level

A final, critical best practice: prefer table properties over session configurations. Session settings only apply to the current Spark session and disappear when the session ends. Table properties persist across sessions and ensure consistent behavior regardless of which job or notebook writes to the table:

CREATE TABLE schema_name.optimized_table (
  id INT,
  data STRING
) TBLPROPERTIES (
  'delta.autoOptimize.autoCompact' = 'true',
  'delta.parquet.vorder.enabled' = 'true'
)

For tables with write patterns that produce many small files (streaming, high-frequency appends), also add optimize write:

ALTER TABLE schema_name.streaming_table
SET TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true'
)

This separation ensures optimize write is only applied where it provides value, rather than adding unnecessary write overhead across all tables.

The Bottom Line

Table maintenance in Fabric is not a set-it-and-forget-it operation. It is a deliberate strategy tied to your data’s lifecycle: fast ingestion at Bronze, balanced reads at Silver, and tuned-to-the-consumer performance at Gold. The tools — OPTIMIZE, VACUUM, auto-compaction, V-Order, Liquid Clustering — are all available. The question is whether you deploy them with intention.

Start by auditing your Gold tables. Check file sizes and distributions. Enable auto-compaction at the table level and use Adaptive Target File Size (ATFS) to let the engine dynamically determine the right file target — this eliminates most manual tuning and makes separate scheduled OPTIMIZE runs unnecessary for tables with auto-compaction enabled. Enable optimize write selectively — only on tables with write patterns that produce small files (streaming, frequent small appends). Apply V-Order where Power BI or SQL consumes the data. And run VACUUM weekly to reclaim storage.

Your tables will thank you. Your dashboards will thank you faster.

This post was written with help from Claude Opus 4.6

Optimizing Spark Performance with the Native Execution Engine (NEE) in Microsoft Fabric

Spark tuning often starts with the usual suspects (shuffle volume, skew, join strategy, caching)… but sometimes the biggest win is simply executing the same logical plan on a faster engine.

Microsoft Fabric’s Native Execution Engine (NEE) does exactly that: it keeps Spark’s APIs and control plane, but runs a large portion of Spark SQL / DataFrame execution on a vectorized C++ engine.

What NEE is (and why it’s fast)

NEE is a vectorized native engine that integrates into Fabric Spark and can accelerate many SQL/DataFrame operators without you rewriting your code.

You still write Spark SQL / DataFrames.
Spark still handles distributed execution and scheduling.
For supported operators, compute is offloaded to a native engine (reducing JVM overhead and using columnar/vectorized execution).

Fabric documentation calls out NEE as being based on Apache Gluten (the Spark-to-native glue layer) and Velox (the native execution library).

When NEE tends to help the most

NEE shines when your workload is:

SQL-heavy (joins, aggregates, projections, filters)
CPU-bound (compute dominates I/O)
Primarily on Parquet / Delta

You’ll see less benefit (or fallback) when you rely on features NEE doesn’t support yet.

How to enable NEE (3 practical options)

1) Environment-level toggle (recommended for teams)

In your Fabric Environment settings, go to Acceleration and enable the native execution engine, then Save + Publish.

Benefit: notebooks and Spark Job Definitions that use that environment inherit the setting automatically.

2) Enable for a single notebook / job via Spark config

In a notebook cell:

%%configure {   "conf": {     "spark.native.enabled": "true"   } }

For Spark Job Definitions, add the same Spark property.

3) Disable/enable per-query when you hit unsupported features

If a specific query uses an unsupported operator/expression and you want to force JVM Spark for that query:

SET spark.native.enabled=FALSE; -- run the query SET spark.native.enabled=TRUE;

How to confirm NEE is actually being used

Two low-friction checks:

Spark UI / History Server: look for plan nodes ending with Transformer or nodes like *NativeFileScan / VeloxColumnarToRowExec.
df.explain(): the same Transformer / NativeFileScan / Velox… hints should appear in the plan.

Fabric also exposes a dedicated view (“Gluten SQL / DataFrame”) to help spot which queries ran on the native engine vs. fell back.

Fallback is a feature (but you should know the common triggers)

NEE includes an automatic fallback mechanism: if the plan contains unsupported features, Spark will run that portion on the JVM engine.

A few notable limitations called out in Fabric documentation:

UDFs aren’t supported (fallback)
Structured streaming isn’t supported (fallback)
File formats like CSV/JSON/XML aren’t accelerated
ANSI mode isn’t supported

There are also some behavioral differences worth remembering (rounding/casting edge cases) if you have strict numeric expectations.

A pragmatic “NEE-first” optimization workflow

Turn NEE on for the environment (or your job) and rerun the workload.
If it’s still slow, open the plan and answer: is the slow part running on the native engine, or did it fall back?
If it fell back, make the smallest possible change to keep the query on the native path (e.g., avoid UDFs; prefer built-in expressions; standardize on Parquet/Delta).
Once the plan stays mostly native, go back to classic Spark tuning: reduce shuffle volume, fix skew, sane partitioning, and confirm broadcast joins.

References

This post was written with help from ChatGPT 5.2

The Best Thing That Ever Happened to Your Spark Pipeline Is a SQL Database

Here’s a counterintuitive claim: the most important announcement for Fabric Spark teams in early 2026 has nothing to do with Spark.

It’s a SQL database.

Specifically, it’s the rapid adoption of SQL database in Microsoft Fabric—a fully managed, SaaS-native transactional database that went GA in November 2025 and has been quietly reshaping how production data flows into lakehouse architectures ever since. If you’re a data engineer running Spark workloads in Fabric, this changes more than you think.

The ETL Pipeline You Can Delete

Most Spark data engineers have a familiar pain point: getting operational data from transactional systems into the lakehouse. You build ingestion pipelines. You schedule nightly batch loads. You wrestle with CDC (change data capture) configurations, watermark columns, and retry logic. You maintain all of it, forever.

SQL database in Fabric eliminates that entire layer.

When data lands in a Fabric SQL database, it’s automatically replicated to OneLake as Delta tables in near real-time. No pipelines. No Spark ingestion jobs. No orchestration. The data just appears, already in the open Delta format your notebooks and Spark jobs expect.

This isn’t a minor convenience—it’s an architectural shift. Every ingestion pipeline you don’t write is a pipeline you don’t debug at 2 AM.

What This Actually Looks Like in Practice

Let’s say you’re building an analytics layer on top of an operational SaaS application. Today, your architecture probably looks something like this:

Application writes to Azure SQL or Cosmos DB
ADF or Spark job pulls data on a schedule
Data lands in a lakehouse as Delta tables
Downstream Spark jobs transform and aggregate

With SQL database in Fabric, steps 2 and 3 vanish. Your application writes directly to the Fabric SQL database, and the mirrored Delta tables are immediately available for Spark processing. Here’s what your downstream notebook looks like now:

# Read operational data directly — no ingestion pipeline needed # The SQL database auto-mirrors to OneLake as Delta tables orders_df = spark.read.format("delta").load(     "abfss://your-workspace@onelake.dfs.fabric.microsoft.com/your-sqldb.SQLDatabase/dbo.Orders" )  # Your transformation logic stays the same from pyspark.sql import functions as F  daily_revenue = (     orders_df     .filter(F.col("order_date") >= F.date_sub(F.current_date(), 7))     .groupBy("product_category")     .agg(         F.sum("total_amount").alias("revenue"),         F.countDistinct("customer_id").alias("unique_customers")     )     .orderBy(F.desc("revenue")) )  daily_revenue.write.format("delta").mode("overwrite").saveAsTable("gold.weekly_revenue_by_category")

The Spark code doesn’t change. What changes is everything upstream of it.

The Migration Risk Nobody’s Talking About

Here’s where it gets interesting—and where Malcolm Gladwell would lean forward in his chair. The biggest risk of SQL database in Fabric isn’t technical. It’s organizational.

Teams that have invested heavily in ingestion infrastructure will face a classic innovator’s dilemma: the new path is simpler, but the old path already works. The temptation is to keep running your existing ADF pipelines alongside the new auto-mirroring capability, creating a hybrid architecture that’s worse than either approach alone.

My recommendation: don’t hybrid. Pick a workload, migrate it end-to-end, and measure. Here’s a concrete rollout checklist:

Identify a candidate workload — Look for Spark jobs whose primary purpose is pulling data from a SQL source into Delta tables. These are your highest-value migration targets.
Provision a Fabric SQL database — It takes seconds. You provide a name; Fabric handles the rest. Autoscaling and auto-pause are built in.
Redirect your application writes — Point your operational application to the new Fabric SQL database. The engine is the same SQL Database Engine as Azure SQL, so T-SQL compatibility is high.
Validate the Delta mirror — Confirm that your data is appearing in OneLake. Check schema fidelity, latency, and row counts:

# In your Spark notebook, validate the mirrored data spark.sql("""     SELECT COUNT(*) as row_count,            MAX(modified_date) as latest_record,            MIN(modified_date) as earliest_record     FROM your_sqldb.dbo.Orders """).show()

Decommission the ingestion pipeline — Once validated, turn off the ADF or Spark ingestion job. Don’t just disable it—delete it. Zombie pipelines are how technical debt accumulates.
Update your monitoring — Your existing data quality checks should still work since the Delta tables have the same schema. But update your alerting to watch for mirror latency instead of pipeline run failures.

The AI Angle Matters for Spark Teams Too

There’s a second dimension to this announcement that Spark engineers should pay attention to: the native vector data type in SQL database supports semantic search and RAG patterns directly in the transactional layer.

Why does that matter for Spark teams? Because it means your embedding pipelines can write vectors back to the same database your application reads from—closing the loop between batch ML processing in Spark and real-time serving in SQL. Instead of maintaining a separate vector store (Pinecone, Qdrant, etc.), you use the same SQL database that’s already mirrored into your lakehouse.

This is the kind of architectural simplification that compounds over time. Fewer systems means fewer failure modes, fewer credentials to manage, and fewer things to explain to your successor.

The Rollout Checklist

This week: Inventory your existing ingestion pipelines. How many just move data from SQL sources to Delta?
This sprint: Provision a Fabric SQL database and test the auto-mirror with a non-critical workload.
This quarter: Migrate your highest-volume ingestion pipeline and measure CU savings.
Track: Mirror latency, CU consumption before/after, and pipeline maintenance hours eliminated.

SQL database in Fabric went GA in November 2025 with enterprise features including row-level security, customer-managed keys, and private endpoints. For the full list of GA capabilities, see the official announcement. To understand how this fits into the broader Microsoft database + Fabric integration strategy, read Microsoft Databases and Microsoft Fabric: Your unified and AI-powered data estate. For Spark-specific Delta Lake concepts, the Delta Lake documentation remains the authoritative reference.

The best thing about this announcement isn’t any single feature. It’s that it makes your Spark architecture simpler by removing the parts that shouldn’t have been there in the first place.

This post was written with help from Claude Opus 4.6

Monitoring Spark Jobs in Real Time in Microsoft Fabric

If Spark performance work is surgery, monitoring is your live telemetry.

Microsoft Fabric gives you multiple monitoring entry points for Spark workloads: Monitor hub for cross-item visibility, item Recent runs for focused context, and application detail pages for deep investigation. This post is a practical playbook for using those together.

Why this matters

When a notebook or Spark job definition slows down, “run it again” is the most expensive way to debug. Real-time monitoring helps you:

spot bottlenecks while jobs are still running
isolate failures quickly
compare behavior across submitters and workspaces

1) Start at the Monitoring hub for cross-workspace triage

Use Monitoring in the Fabric navigation pane as your control tower.

Filter by item type (Notebook, Spark job definition, Pipeline)
Narrow by start time and workspace
Sort by duration or status to surface outliers

For broad triage, this is faster than jumping directly into individual notebooks.

2) Pivot to Spark application details for root-cause analysis

Once you identify a problematic run, open the Spark application detail page and work through tabs in order:

Jobs: status, stages, tasks, duration, and processed/read/written data
Resources: executor allocation and utilization in near real time
Logs: inspect Livy, Prelaunch, and Driver logs; download when needed
Item snapshots: confirm exactly what code/parameters/settings were used at execution time

This sequence prevents false fixes where you tune the wrong layer.

3) Use notebook contextual monitoring while developing

For iterative tuning, notebook contextual monitoring keeps authoring, execution, and debugging in one place.

Run a target cell/workload
Watch job/stage/task progress and executor behavior
Jump to Spark UI or detail monitoring for deeper traces
Adjust code or config and rerun

4) A lightweight real-time runbook

Confirm scope in the Monitoring hub (single run or systemic pattern)
Open application details for the failing/slower run
Check Jobs for stage/task imbalance and long-running segments
Check Resources for executor pressure
Check Logs for explicit failure signals
Verify snapshots so you debug the exact submitted artifact

Common mistakes to avoid

Debugging from memory instead of snapshots
Looking only at notebook cell output and skipping Logs/Resources
Treating one anomalous run as a global trend without Monitor hub filtering

References

This post was written with help from ChatGPT 5.3

Lakehouse Table Optimization: VACUUM, OPTIMIZE, and Z-ORDER

If your Lakehouse tables are getting slower (or more expensive) over time, it’s often not “Spark is slow.” It’s usually table layout drift: too many small files, suboptimal clustering, and old files piling up.

In Fabric Lakehouse, the three table-maintenance levers you’ll reach for most are:

OPTIMIZE: compacts many small files into fewer, larger files (and can apply clustering)
Z-ORDER: co-locates related values to improve data skipping for common filters
VACUUM: deletes old files that are no longer referenced by the Delta transaction log (after a retention window)

Practical note: in Fabric, run these as Spark SQL in a notebook or Spark job definition (or use the Lakehouse maintenance UI). Don’t try to run them in the SQL Analytics Endpoint.

1) Start with the symptom: “small files” vs “bad clustering”

Before you reach for maintenance, quickly sanity-check what you’re fighting:

Many small files → queries spend time opening/reading lots of tiny Parquet files.
Poor clustering for your most common predicates (date, tenantId, customerId, region, etc.) → queries scan more data than they need.
Heavy UPDATE/DELETE/MERGE patterns → lots of new files + tombstones + time travel files.

If you only have small files, OPTIMIZE is usually your first win.

2) OPTIMIZE: bin-packing for fewer, bigger files

Basic compaction

OPTIMIZE my_table;

Target a subset (example: recent partitions)

OPTIMIZE my_table WHERE date >= date_sub(current_date(), 7);

A useful mental model: OPTIMIZE is rewriting file layout (not changing table results). It’s maintenance, not transformation.

3) Z-ORDER: make your filters cheaper

Z-Ordering is for the case where you frequently query:

WHERE tenantId = ...
WHERE customerId = ...
WHERE deviceId = ... AND eventTime BETWEEN ...

Example:

OPTIMIZE my_table ZORDER BY (tenantId, eventDate);

Pick 1–3 columns that dominate your interactive workloads. If you try to z-order on everything, you’ll mostly burn compute for little benefit.

4) VACUUM: clean up old, unreferenced files (carefully)

VACUUM is about storage hygiene. Delta keeps old files around to support time travel and concurrent readers. VACUUM deletes files that are no longer referenced and older than the configured retention threshold.

VACUUM my_table;

Two practical rules:

Don’t VACUUM aggressively unless you understand the impact on time travel / rollback.
Treat the retention window as a governance decision (what rollback window do you want?) not just a cost optimization.

5) Fabric-specific gotchas (the ones that actually bite)

Where you can run these commands

These are Spark SQL maintenance commands. In Fabric, that means notebooks / Spark job definitions (or the Lakehouse maintenance UI), not the SQL Analytics Endpoint.

V-Order and OPTIMIZE

Fabric also has V-Order, which is a Parquet layout optimization aimed at faster reads across Fabric engines. If you’re primarily optimizing for downstream read performance (Power BI/SQL/Spark), it’s worth understanding whether V-Order is enabled for your workspace and table writes.

A lightweight maintenance pattern that scales

Nightly/weekly: OPTIMIZE high-value tables (or recent partitions)
Weekly/monthly: Z-ORDER tables with stable query patterns
Monthly: VACUUM tables where your org’s time travel policy is clear

Treat it like index maintenance: regular, boring, measurable.

References

This post was written with help from ChatGPT 5.2

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

What the connector actually is

Reading: the part that mostly just works

Writing: the part with surprises

How writes actually work under the hood

Save modes

The timestamp_ntz gotcha

What you can’t write to

Private Link limitations

Time Travel is not supported

When to use Warehouse vs. Lakehouse as your serving layer

A concrete pattern: Spark ETL → Warehouse serving layer

Cross-database queries: the glue between them

Performance notes from the field

The honest summary

Share this:

What actually changed

Why this reporting update is a win for operators

The migration checklist

1. Audit your AI function usage

2. Baseline your current Spark consumption

3. Adjust your alerting thresholds

4. Update your capacity planning models

5. Set up a validation window

6. Share a quick team note before questions start

Post-rollout checks that keep things clean

Why separated AI spend is valuable

What this signals about Fabric billing

Share this:

Prerequisites that actually matter

Semantic Link: the part most teams underestimate

Training guardrails worth building

Batch scoring: the PREDICT function in production

Real-time endpoints: know exactly what you are signing up for

The rollback plan you need to write before you ship

Ship it

Share this:

Mental model: OneLake is the handshake

Start here: 3-Part Naming over the SQL Endpoint

When this is enough (and it often is)

When to materialize instead

The CU tradeoff

Pattern 1 — Spark → Warehouse via OneLake files (COPY INTO + OPENROWSET)

When to use it

Step 1: Write a “handoff” dataset from Spark

Step 2: Inspect the file shape from the Warehouse (OPENROWSET)

Step 3: Ingest into a Warehouse table (COPY INTO)

Operational guardrails

Pattern 2 – Spark → Warehouse via in-workspace tables (CTAS / INSERT…SELECT)

When to use it

Step 1: Produce a curated Delta table with Spark

Step 2: Materialize a Warehouse table from the Lakehouse table

Pattern 3 – Warehouse → Spark via published Delta logs (read-only)

Step 1: Get the OneLake path for a Warehouse table

Step 2: Read the Warehouse table from Spark (read-only)

Bonus control: pause Delta log publishing

Choosing an ownership model (so you don’t end up with two sources of truth)

Quick checklist: production-hardening the Spark ↔ Warehouse boundary

Summary

References

Share this:

Your operational data now lands in OneLake automatically

How this changes daily Spark development

Migration risks worth planning for

A rollout checklist for Spark teams

Vector search: a side door into AI workloads

What to expect next

Share this:

Every Engine Wants Something Different

The Three Commands That Keep You Alive

OPTIMIZE: Bin Compaction

VACUUM: Dead File Cleanup

Auto-Compaction + Optimize Write: Prevention Over Cure

The Medallion Layer Checklist

Bronze (Landing Zone)

Silver (Curated Zone)

Gold (Serving Zone)

V-Order: Know When to Pay the Tax

Liquid Clustering vs. Partitioning vs. Z-Order

Diagnosing Table Health