
If you’ve spent any time in a Fabric workspace with both Data Engineering (Spark) and Data Warehouse, you’ve probably had this moment:
- Spark is great for big transformations, complex parsing, and “just let me code it.”
- The Warehouse is great for a curated SQL model, concurrency, and giving the BI world a stable contract.
- And yet… teams still end up copying data around like they’re paid by the duplicate.
The good news: Fabric’s architectural bet is that OneLake + Delta is the contract surface across engines. That means you can design a pipeline where Spark and Warehouse cooperate instead of competing.
This post is a practical field guide to the integration patterns that work well in real projects:
- Spark → Warehouse (file-based ingest) using
COPY INTOandOPENROWSETover OneLake paths. - Spark → Warehouse (table-to-table ingest) using cross-database queries /
CTAS/INSERT…SELECTinside the workspace. - Warehouse → Spark (read-only consumption) by reading the Warehouse table’s published Delta logs from Spark.
Along the way, I’ll call out the trade-offs, the gotchas, and the operational guardrails that keep teams out of trouble.
Mental model: OneLake is the handshake
In Fabric, multiple experiences can produce and consume Delta Lake tables. Microsoft Learn describes Delta Lake as the standard analytics table format in Fabric, and notes that Delta tables produced by one engine (including Fabric Data Warehouse and Spark) can be consumed by other engines.
So instead of thinking “Spark output” and “Warehouse tables” as two unrelated worlds, treat them as:
- A shared storage plane (OneLake)
- An open table format (Delta + Parquet)
- Two compute engines with different strengths
The rest is just choosing where to materialize and who owns writes.
Pattern 1 — Spark → Warehouse via OneLake files (COPY INTO + OPENROWSET)
When to use it
Use this pattern when:
- Your Spark pipeline already produces files (Parquet/CSV/JSONL) under a Lakehouse Files path.
- You want a fast, SQL-native ingest into Warehouse tables.
- You want a clean separation: Spark writes files; Warehouse owns the serving tables.
Step 1: Write a “handoff” dataset from Spark
In Spark, write a handoff dataset into the Lakehouse Files area (not Tables). Conceptually:
(
df
.write
.mode("overwrite")
.format("parquet")
.save("Files/handoff/sales_daily/")
)
Why Files? Because the Warehouse can point COPY INTO / OPENROWSET at file paths, and the Files area is designed to hold arbitrary file layouts.
Step 2: Inspect the file shape from the Warehouse (OPENROWSET)
Before you ingest, use OPENROWSET to browse a file (or a set of files) and confirm the schema is what you think it is.
Microsoft Learn documents that Fabric Warehouse OPENROWSET can read Parquet/CSV files, and that the files can be stored in Azure Blob Storage, ADLS, or Fabric OneLake (with OneLake reads called out as preview).
SELECT TOP 10 *
FROM OPENROWSET(
BULK 'https://onelake.dfs.fabric.microsoft.com/<workspaceId>/<lakehouseId>/Files/handoff/sales_daily/*.parquet'
) AS rows;
Step 3: Ingest into a Warehouse table (COPY INTO)
The Fabric blog announcement for OneLake as a source for COPY INTO and OPENROWSET highlights the point of this feature: load and query Lakehouse file folders without external staging storage or SAS tokens.
COPY INTO dbo.SalesDaily
FROM 'https://onelake.dfs.fabric.microsoft.com/<workspaceId>/<lakehouseId>/Files/handoff/sales_daily/'
WITH (
FILE_TYPE = 'PARQUET'
);
Operational guardrails
- Treat the Files path as a handoff contract: version it, keep it predictable, and don’t “just drop random stuff in there.”
- If you’ll query the same external data repeatedly, ingest it into a dedicated Warehouse table (Microsoft Learn notes repeated
OPENROWSETaccess can be slower than querying a table).
Pattern 2 — Spark → Warehouse via in-workspace tables (CTAS / INSERT…SELECT)
When to use it
- Your Spark output is naturally a Delta table (Lakehouse Tables area).
- You want the Warehouse model to be built from workspace items using SQL (joins, dimensional modeling, serving-layer logic).
- You want to avoid file-level ingestion steps.
Step 1: Produce a curated Delta table with Spark
(
df_clean
.write
.mode("overwrite")
.format("delta")
.save("Tables/clean_sales")
)
Step 2: Materialize a Warehouse table from the Lakehouse table
Microsoft Learn notes that for T-SQL ingestion, you can use patterns like INSERT…SELECT, SELECT INTO, or CREATE TABLE AS SELECT (CTAS) to create or update tables from other items in the same workspace (including lakehouses).
CREATE TABLE dbo.FactSales
AS
SELECT
OrderDate,
StoreId,
ProductId,
Quantity,
NetAmount
FROM MyLakehouse.dbo.clean_sales;
For incremental loads you’ll often end up with a staging + merge strategy, but the key idea stays the same: Spark produces the curated dataset; the Warehouse owns the serving tables.
Pattern 3 — Warehouse → Spark via published Delta logs (read-only)
This is the pattern that surprises people (in a good way): the Warehouse isn’t a closed box.
Microsoft Learn documents that Warehouse user tables are stored in Parquet, and that Delta Lake logs are published for all user tables. The key consequence is that any engine that can read Delta tables can get direct access to Warehouse tables — read-only.
Step 1: Get the OneLake path for a Warehouse table
In the Warehouse UI, table Properties exposes the table’s URL / ABFS URI (Learn walks through the steps).
Step 2: Read the Warehouse table from Spark (read-only)
warehouse_table_path = "abfss://<workspace>@onelake.dfs.fabric.microsoft.com/<warehouseId>/Tables/dbo/FactSales"
fact_sales_df = spark.read.format("delta").load(warehouse_table_path)
- This access is read-only from Spark. Writes must go through the Warehouse to maintain ACID compliance.
- Delta log publishing is a background process after commits, so treat cross-engine visibility as “near real-time,” not “every millisecond.”
Bonus control: pause Delta log publishing
The same Learn doc describes an operational lever you can use when you need stability during a large set of changes:
ALTER DATABASE CURRENT SET DATA_LAKE_LOG_PUBLISHING = PAUSED;
-- ... bulk updates ...
ALTER DATABASE CURRENT SET DATA_LAKE_LOG_PUBLISHING = AUTO;
When publishing is paused, other engines see the pre-pause snapshot; Warehouse queries still see the latest.
Choosing an ownership model (so you don’t end up with two sources of truth)
The integration is easy. The contract is the hard part.
A simple rule that prevents a lot of pain:
- If Spark is writing it: Warehouse can ingest it, but Spark owns the dataset.
- If Warehouse is writing it: Spark can read it, but Warehouse owns the dataset.
In other words: pick one writer.
For most analytics teams, a good default is:
- Spark owns bronze/silver (raw + cleaned Delta in the Lakehouse)
- Warehouse owns gold (facts/dimensions, KPI-ready serving tables)
Quick checklist: production-hardening the Spark ↔ Warehouse boundary
- Make the handoff explicit (a specific Files path or a specific Lakehouse table).
- Version your schema (breaking changes should be intentional and tested).
- Avoid singleton inserts into Warehouse; prefer bulk patterns (
CTAS,INSERT…SELECT). - Validate row counts and freshness after each load (and alert on drift).
- Treat Delta log publishing as eventual across engines; design your BI/ML expectations accordingly.
Summary
Fabric is at its best when you let each engine do what it’s good at:
- Spark for transformation, enrichment, and complex data engineering logic.
- Warehouse for the curated serving model and SQL-first consumers.
OneLake + Delta is the glue. Use it deliberately, and you can build an end-to-end pipeline that feels like one system instead of two that awkwardly share a closet.
References
- Delta Lake table format interoperability (Microsoft Learn)
- Delta Lake Logs in Warehouse (Microsoft Learn)
- Ingest data into the Warehouse (Microsoft Learn)
- Browse file content with OPENROWSET (Microsoft Learn)
- OneLake as a source for COPY INTO and OPENROWSET (Preview) (Microsoft Fabric Blog)
