The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

There’s a connector that ships with every Fabric Spark runtime. It’s pre-installed. It requires no setup. And it lets your Spark notebooks read from—and write to—Fabric Data Warehouse tables as naturally as they read Delta tables from a Lakehouse.

Most Fabric Spark users don’t know it exists. The ones who do often run into the same three or four surprises. Let’s fix both problems.

What the connector actually is

The Spark connector for Fabric Data Warehouse (synapsesql) is a built-in extension to the Spark DataFrame API. It uses the TDS protocol to talk directly to the SQL engine behind your Warehouse (or the SQL analytics endpoint of a Lakehouse). You get read and write access to Warehouse tables from PySpark, Scala Spark, or Spark SQL.

One line of code to read:

from com.microsoft.spark.fabric.Constants import Constants

df = spark.read.synapsesql("my_warehouse.dbo.sales_fact")


One line to write:

df.write.mode("append").synapsesql("my_warehouse.dbo.sales_fact")


No connection strings. No passwords. No JDBC driver management. Authentication flows through Microsoft Entra—same identity you’re logged into your Fabric workspace with. The connector resolves the SQL endpoint automatically based on workspace context.

That’s the happy path. Now let’s talk about what actually happens when you use it.

Reading: the part that mostly just works

Reading from a Warehouse table into a Spark DataFrame is the connector’s strength. The synapsesql() call supports the full three-part naming convention: warehouse_name.schema_name.table_or_view_name. It works for tables and views, including views with joins across schemas.

A few things that are genuinely useful:

Predicate pushdown works. When you chain .filter() or .limit() onto your DataFrame, the connector pushes those constraints to the SQL engine. You’re not pulling the full table into Spark memory and then filtering—the SQL engine handles the filter and sends back the subset. This matters when your Warehouse tables have hundreds of millions of rows and you only need a time-sliced sample.

df = spark.read.synapsesql("my_warehouse.dbo.sales_fact") \
    .filter("order_date >= '2026-01-01'") \
    .select("order_id", "customer_id", "amount")


Cross-workspace reads work. If your Warehouse lives in a different workspace than your notebook’s attached Lakehouse, you pass the workspace ID:

df = spark.read \
    .option(Constants.WorkspaceId, "<target-workspace-id>") \
    .option(Constants.DatawarehouseId, "<warehouse-item-id>") \
    .synapsesql("my_warehouse.dbo.sales_fact")


This is genuinely powerful for hub-and-spoke architectures where your curated Warehouse sits in a production workspace and your data science notebooks live in a sandbox workspace.

Parallel reads are available. For large tables, you can partition the read across multiple Spark tasks, similar to spark.read.jdbc:

df = spark.read \
    .option("partitionColumn", "order_id") \
    .option("lowerBound", 1) \
    .option("upperBound", 10000000) \
    .option("numPartitions", 8) \
    .synapsesql("my_warehouse.dbo.sales_fact")


This splits the query into eight parallel reads, each fetching a range of order_id. Without this, you get a single-threaded read that will bottleneck on large tables.

Security models pass through. If your Warehouse has object-level security (OLS), row-level security (RLS), or column-level security (CLS), those policies are enforced when Spark reads the data. Your notebook sees exactly what your identity is authorized to see. This is a meaningful difference from reading Delta files directly via OneLake, where security operates at the workspace or folder level.

Custom T-SQL queries work too. You’re not limited to reading tables—you can pass arbitrary T-SQL:

df = spark.read \
    .option(Constants.DatabaseName, "my_warehouse") \
    .synapsesql("SELECT TOP 1000 * FROM dbo.sales_fact WHERE region = 'WEST'")


This is handy for complex aggregations or when you want the SQL engine to do the heavy lifting before data enters Spark.

Writing: the part with surprises

Write support for the Spark-to-Warehouse connector became generally available with Runtime 1.3. It works, and it solves a real architectural problem—but it has mechanics you need to understand.

How writes actually work under the hood

The connector uses a two-phase process:

  1. Stage: Spark writes your DataFrame to intermediate Parquet files in a staging location.
  2. Load: The connector issues a COPY INTO command, telling the Warehouse SQL engine to ingest from the staged files.

This is the same COPY INTO pattern that powers bulk ingestion into Fabric Data Warehouse generally. It’s optimized for throughput. It is not optimized for latency on small writes.

If you’re writing a DataFrame with 50 rows, the overhead of staging files and issuing COPY INTO means the write takes materially longer than you’d expect. For small, frequent writes, this connector is not the right tool. Use T-SQL INSERT statements through a SQL connection instead.

For batch writes of thousands to millions of rows, the connector performs well. The COPY INTO path is what the Warehouse was designed for.

Save modes

The connector supports four save modes:

  • errorifexists (default): Fails if the table already exists.
  • ignore: Silently skips the write if the table exists.
  • overwrite: Drops and recreates the table with new data.
  • append: Adds rows to the existing table.
df.write.mode("overwrite").synapsesql("my_warehouse.dbo.daily_aggregates")


A common pattern: Spark computes daily aggregations from Lakehouse Delta tables, then writes the results to a Warehouse table that Power BI reports connect to. The Warehouse’s result set caching (now generally available as of January 2026) means subsequent Power BI refreshes hit cache instead of recomputing.

The timestamp_ntz gotcha

This is the single most common error people hit when writing to a Warehouse from Spark.

If your DataFrame contains timestamp_ntz (timestamp without time zone) columns, the write will fail. Fabric Data Warehouse expects time-zone-aware timestamps. The fix is a cast before you write:

from pyspark.sql.functions import col

for c in df.columns:
    if dict(df.dtypes)[c] == "timestamp_ntz":
        df = df.withColumn(c, col(c).cast("timestamp"))

df.write.mode("append").synapsesql("my_warehouse.dbo.target_table")


This is not documented prominently enough. If you see a Py4JJavaError during write that mentions type conversion, timestamps are the first thing to check.

What you can’t write to

The connector writes to Warehouse tables only. You cannot write to the SQL analytics endpoint of a Lakehouse—it’s read-only. If you try, you’ll get an error. This seems obvious but trips people up because the same synapsesql() method handles both reads from Warehouses and Lakehouse SQL endpoints.

Private Link limitations

If Private Link is enabled at the workspace level, both read and write operations through the connector are unsupported. If Private Link is enabled at the tenant level only, writes are unsupported but reads still work. This is a significant limitation for security-conscious deployments. Check your network configuration before building pipelines that depend on this connector.

Time Travel is not supported

Fabric Data Warehouse now supports Time Travel queries. However, the Spark connector does not pass through Time Travel syntax. If you need to query a table as of a specific point in time, you’ll need to use a T-SQL connection directly rather than the synapsesql() method.

When to use Warehouse vs. Lakehouse as your serving layer

This is the architectural question that the connector’s existence forces you to answer. You’ve got data in your Lakehouse. Spark has transformed it. Now where does it go?

Use Lakehouse Delta tables when:

  • Your consumers are other Spark notebooks or Spark-based ML pipelines.
  • You need schema evolution flexibility (Delta’s schema merge).
  • You want to use OPTIMIZE, VACUUM, and Z-ORDER for table maintenance.
  • Your data scientists need direct file access through OneLake APIs.

Use Warehouse tables when:

  • Your primary consumers are Power BI reports or T-SQL analysts.
  • You need the Warehouse’s result set caching for repeated query patterns.
  • You need fine-grained security (RLS, CLS, OLS) that passes through to all consumers.
  • You want to use T-SQL stored procedures, views, and MERGE statements for downstream transformations.
  • You need cross-database queries that join Warehouse tables with Lakehouse tables or other Warehouse tables.

Use both when:

  • Spark processes and stores data in the Lakehouse (bronze → silver → gold medallion layers), then the connector writes final aggregations or serving tables to the Warehouse.
  • The Warehouse serves as the “last mile” between your data engineering work and your business intelligence layer.

The January 2026 GA of MERGE in Fabric Data Warehouse makes the “write to Warehouse” pattern significantly more useful. You can now do incremental upserts: Spark writes a staging table, then a T-SQL MERGE reconciles it with the target. This is a common pattern in data warehousing that was previously awkward in Fabric.

A concrete pattern: Spark ETL → Warehouse serving layer

Here’s the pattern I see working well in production:

# 1. Read from Lakehouse Delta tables (Spark native)
bronze = spark.read.format("delta").load("Tables/raw_orders")

# 2. Transform in Spark
silver = bronze.filter(col("status") != "cancelled") \
    .withColumn("order_date", col("order_ts").cast("date")) \
    .withColumn("amount_usd", col("amount") * col("fx_rate"))

gold = silver.groupBy("region", "order_date") \
    .agg(
        count("order_id").alias("order_count"),
        sum("amount_usd").alias("total_revenue")
    )

# 3. Write to Warehouse for Power BI consumption
gold.write.mode("overwrite").synapsesql("analytics_warehouse.dbo.daily_revenue")


The Lakehouse owns the raw and transformed data. Spark does the heavy compute. The Warehouse serves the final tables to downstream consumers with T-SQL access, caching, and fine-grained security.

The alternative—writing gold tables to the Lakehouse and having Power BI connect via the SQL analytics endpoint—also works. But the SQL analytics endpoint has a metadata sync delay after Spark writes new data. The Warehouse table is immediately consistent after the COPY INTO completes. If your reporting needs to reflect the latest pipeline run without a sync lag, the Warehouse path is more reliable.

Cross-database queries: the glue between them

Once you have data in both a Lakehouse and a Warehouse in the same workspace, you can query across them using T-SQL cross-database queries from the Warehouse:

SELECT w.customer_id, w.total_revenue, l.customer_segment
FROM analytics_warehouse.dbo.daily_revenue AS w
JOIN my_lakehouse.dbo.customer_dim AS l
    ON w.customer_id = l.customer_id


This means your Warehouse doesn’t need to contain all the data. It can hold the curated aggregations while joining against dimension tables that live in the Lakehouse. No data movement. No duplication. The SQL engine resolves both sources through OneLake.

Performance notes from the field

A few observations from real workloads:

Reads are faster than you expect. The TDS protocol connection to the Warehouse SQL engine is efficient. For typical analytical queries returning thousands to low millions of rows, the synapsesql() read is competitive with reading Delta files directly, especially when the Warehouse has statistics and result set caching enabled.

Writes are slower than Lakehouse writes. The two-phase staging + COPY INTO process adds overhead versus a direct df.write.format("delta").save() to Lakehouse tables. For a DataFrame with 10 million rows, expect the Warehouse write to take 2-5x longer than an equivalent Lakehouse Delta write. This is the tradeoff for getting immediate T-SQL access with full Warehouse capabilities.

Use parallel reads for large tables. The default single-partition read will bottleneck. Set numPartitions to match your Spark cluster’s available cores for large reads. The performance improvement is often 4-8x.

Proactive and incremental statistics refresh. As of January 2026, Fabric Data Warehouse supports proactive statistics refresh and incremental statistics. This means the query optimizer keeps statistics up to date automatically. Your synapsesql() reads benefit from better query plans without manual UPDATE STATISTICS calls.

The honest summary

The Spark connector for Fabric Data Warehouse is a well-designed bridge between two systems that many teams use side by side. It makes the read path simple and the write path possible without leaving your Spark notebook.

It is not a replacement for writing to Lakehouse Delta tables. It is an additional output path for when your downstream consumers need T-SQL, fine-grained security, result set caching, or immediate consistency. Use it when the Warehouse is the right serving layer. Don’t use it when Lakehouse is sufficient.

The biggest wins come from combining both: Spark for compute, Lakehouse for storage, Warehouse for serving. The connector is the plumbing that makes that architecture work without data pipelines in between.

If you’re heading to FabCon Atlanta (March 16-20, 2026), both the Data Warehouse and Data Engineering teams will be there. It’s a good place to pressure-test your architecture and see what’s coming next.


This post was written with help from anthropic/claude-opus-4-6

Microsoft Fabric Warehouse + Spark: Interoperability Patterns That Actually Work

If you’ve spent any time in a Fabric workspace with both Data Engineering (Spark) and Data Warehouse, you’ve probably had this moment:

  • Spark is great for big transformations, complex parsing, and “just let me code it.”
  • The Warehouse is great for a curated SQL model, concurrency, and giving the BI world a stable contract.
  • And yet… teams still end up copying data around like they’re paid by the duplicate.

The good news: Fabric’s architectural bet is that OneLake + Delta is the contract surface across engines. That means you can design a pipeline where Spark and Warehouse cooperate instead of competing.

This post is a practical field guide to the integration patterns that work well in real projects:

  1. 3-part naming over the SQL endpoint (zero-copy default) – query Lakehouse Delta tables directly from Warehouse SQL without moving data.
  2. Spark → Warehouse (file-based ingest) using COPY INTO and OPENROWSET over OneLake paths – when workload evidence calls for materialization.
  3. Spark → Warehouse (table-to-table ingest) using cross-database queries / CTAS / INSERT…SELECT – same trigger.
  4. Warehouse → Spark (read-only consumption) by reading the Warehouse table’s published Delta logs from Spark.

Along the way, I’ll call out the trade-offs, the gotchas, and the operational guardrails that keep teams out of trouble.


Mental model: OneLake is the handshake

In Fabric, multiple experiences can produce and consume Delta Lake tables. Microsoft Learn describes Delta Lake as the standard analytics table format in Fabric, and notes that Delta tables produced by one engine (including Fabric Data Warehouse and Spark) can be consumed by other engines.

So instead of thinking “Spark output” and “Warehouse tables” as two unrelated worlds, treat them as:

  • A shared storage plane (OneLake)
  • An open table format (Delta + Parquet)
  • Two compute engines with different strengths

The rest is just choosing where to materialize — or whether to materialize at all.


Start here: 3-Part Naming over the SQL Endpoint

Before you copy anything, ask: do I actually need a separate materialized table?

Fabric’s SQL analytics endpoint automatically exposes every Lakehouse Delta table as a queryable SQL object. From the Warehouse, you can reference those tables directly using 3-part naming:

SELECT *
FROM MyLakehouse.dbo.clean_sales
WHERE OrderDate >= '2026-01-01';

No COPY INTO. No CTAS. No duplicate storage. The query runs against the Lakehouse’s Delta files through the SQL endpoint — zero-copy interoperability out of the box.

When this is enough (and it often is)

  • Ad-hoc analytics and exploration across Spark-produced datasets.
  • Lightweight joins between Warehouse dimensions and Lakehouse facts.
  • BI semantic models that don’t need sub-second concurrency at scale.
  • Early-stage projects where the workload profile isn’t settled yet.

When to materialize instead

Materialize into dedicated Warehouse tables (COPY INTO, CTAS, INSERT…SELECT) when workload evidence justifies it:

  • High concurrency: many concurrent queries hitting the same dataset consistently.
  • Recurring heavy joins/aggregations: repeated complex queries where pre-materialized tables measurably reduce compute.
  • Stricter SLA / CU predictability: when you need tighter control over query performance and capacity consumption.
  • Governance boundaries: when the Warehouse should own and version the serving-layer schema independently from the Lakehouse.

If none of those conditions apply, 3-part naming is the right default. You can always materialize later when the numbers say you should.

The CU tradeoff

Virtualization (3-part naming) shifts cost to query-time: every read traverses the SQL endpoint and pays CU at execution. Materialization (COPY INTO / CTAS) pays an ingestion and storage cost once, so repeated reads are faster and more predictable in CU terms. Neither is universally better — the right call depends on query frequency, data volume, and your capacity budget.


Pattern 1 — Spark → Warehouse via OneLake files (COPY INTO + OPENROWSET)

When to use it

Start with 3-part naming. Reach for COPY INTO / OPENROWSET file-based ingest only when workload evidence (sustained concurrency pressure, SLA requirements, or CU unpredictability) tells you virtualization isn’t enough. This pattern fits when:

  • Your Spark pipeline already produces files (Parquet/CSV/JSONL) under a Lakehouse Files path.
  • You need faster or more predictable query performance than the SQL endpoint provides for this dataset.
  • You want a clean separation: Spark writes files; Warehouse owns the serving tables.

Step 1: Write a “handoff” dataset from Spark

In Spark, write a handoff dataset into the Lakehouse Files area (not Tables). Conceptually:

(
  df
  .write
  .mode("overwrite")
  .format("parquet")
  .save("Files/handoff/sales_daily/")
)


Why Files? Because the Warehouse can point COPY INTO / OPENROWSET at file paths, and the Files area is designed to hold arbitrary file layouts.

Step 2: Inspect the file shape from the Warehouse (OPENROWSET)

Before you ingest, use OPENROWSET to browse a file (or a set of files) and confirm the schema is what you think it is.

Microsoft Learn documents that Fabric Warehouse OPENROWSET can read Parquet/CSV files, and that the files can be stored in Azure Blob Storage, ADLS, or Fabric OneLake (with OneLake reads called out as preview).

SELECT TOP 10 *
FROM OPENROWSET(
  BULK 'https://onelake.dfs.fabric.microsoft.com/<workspaceId>/<lakehouseId>/Files/handoff/sales_daily/*.parquet'
) AS rows;


Step 3: Ingest into a Warehouse table (COPY INTO)

The Fabric blog announcement for OneLake as a source for COPY INTO and OPENROWSET highlights the point of this feature: load and query Lakehouse file folders without external staging storage or SAS tokens.

COPY INTO dbo.SalesDaily
FROM 'https://onelake.dfs.fabric.microsoft.com/<workspaceId>/<lakehouseId>/Files/handoff/sales_daily/'
WITH (
  FILE_TYPE = 'PARQUET'
);


Operational guardrails

  • Treat the Files path as a handoff contract: version it, keep it predictable, and don’t “just drop random stuff in there.”
  • If you’ll query the same external data repeatedly, ingest it into a dedicated Warehouse table (Microsoft Learn notes repeated OPENROWSET access can be slower than querying a table).

Pattern 2 – Spark → Warehouse via in-workspace tables (CTAS / INSERT…SELECT)

When to use it

As with Pattern 1, start with 3-part naming and materialize via CTAS / INSERT…SELECT only when workload metrics confirm you need it. This pattern fits when:

  • Your Spark output is naturally a Delta table (Lakehouse Tables area) and 3-part naming queries against it hit concurrency or performance limits.
  • You want the Warehouse to own a curated serving-layer model (joins, dimensional modeling, computed columns) with predictable CU spend.
  • You prefer SQL-native table-to-table pipelines over file-level ingestion.

Step 1: Produce a curated Delta table with Spark

(
  df_clean
  .write
  .mode("overwrite")
  .format("delta")
  .save("Tables/clean_sales")
)


Step 2: Materialize a Warehouse table from the Lakehouse table

Microsoft Learn notes that for T-SQL ingestion, you can use patterns like INSERT…SELECT, SELECT INTO, or CREATE TABLE AS SELECT (CTAS) to create or update tables from other items in the same workspace (including lakehouses).

CREATE TABLE dbo.FactSales
AS
SELECT
  OrderDate,
  StoreId,
  ProductId,
  Quantity,
  NetAmount
FROM MyLakehouse.dbo.clean_sales;


For incremental loads you’ll often end up with a staging + merge strategy, but the key idea stays the same: Spark produces the curated dataset; the Warehouse owns the serving tables.


Pattern 3 – Warehouse → Spark via published Delta logs (read-only)

This is the pattern that surprises people (in a good way): the Warehouse isn’t a closed box.

Microsoft Learn documents that Warehouse user tables are stored in Parquet, and that Delta Lake logs are published for all user tables. The key consequence is that any engine that can read Delta tables can get direct access to Warehouse tables – read-only.

Step 1: Get the OneLake path for a Warehouse table

In the Warehouse UI, table Properties exposes the table’s URL / ABFS URI (Learn walks through the steps).

Step 2: Read the Warehouse table from Spark (read-only)

warehouse_table_path = "abfss://<workspace>@onelake.dfs.fabric.microsoft.com/<warehouseId>/Tables/dbo/FactSales"

fact_sales_df = spark.read.format("delta").load(warehouse_table_path)

  • This access is read-only from Spark. Writes must go through the Warehouse to maintain ACID compliance.
  • Delta log publishing is a background process after commits, so treat cross-engine visibility as “near real-time,” not “every millisecond.”

Bonus control: pause Delta log publishing

The same Learn doc describes an operational lever you can use when you need stability during a large set of changes:

ALTER DATABASE CURRENT SET DATA_LAKE_LOG_PUBLISHING = PAUSED;
-- ... bulk updates ...
ALTER DATABASE CURRENT SET DATA_LAKE_LOG_PUBLISHING = AUTO;


When publishing is paused, other engines see the pre-pause snapshot; Warehouse queries still see the latest.


Choosing an ownership model (so you don’t end up with two sources of truth)

The integration is easy. The contract is the hard part.

A simple rule that prevents a lot of pain:

  • If Spark is writing it: Warehouse can ingest it, but Spark owns the dataset.
  • If Warehouse is writing it: Spark can read it, but Warehouse owns the dataset.

In other words: pick one writer.

For most analytics teams, a good default is:

  • Spark owns bronze/silver (raw + cleaned Delta in the Lakehouse)
  • Warehouse owns gold (facts/dimensions, KPI-ready serving tables) — but “owns” doesn’t always mean “physically copies.” A cross-database query via 3-part naming can serve gold-layer reads without materialization.

Start with 3-part naming for cross-engine reads. Materialize across the boundary only when workload metrics — not assumptions — tell you to. Remember: virtualization shifts CU cost to query-time; materialization front-loads ingestion and storage so repeated reads are cheaper and more predictable. Let your actual usage patterns decide.


Quick checklist: production-hardening the Spark ↔ Warehouse boundary

  • Make the handoff explicit (a specific Files path or a specific Lakehouse table).
  • Version your schema (breaking changes should be intentional and tested).
  • Avoid singleton inserts into Warehouse; prefer bulk patterns (CTAS, INSERT…SELECT).
  • Validate row counts and freshness after each load (and alert on drift).
  • Treat Delta log publishing as eventual across engines; design your BI/ML expectations accordingly.

Summary

Fabric is at its best when you let each engine do what it’s good at:

  • Spark for transformation, enrichment, and complex data engineering logic.
  • Warehouse for the curated serving model and SQL-first consumers.

OneLake + Delta is the glue. Start with 3-part naming for zero-copy interoperability across engines, and materialize only when workload evidence justifies the extra storage and ingestion cost. That way you get the simplicity of one logical data layer without paying for copies you don’t need.

This post was written with help from Opus 4.6

References