Open Mirroring + OneLake: Spark patterns that keep latency from eating your weekends

Dev is clean. Prod is chaos. In dev, your mirrored table has a cute little dataset and Spark tears through it. In prod, that same notebook starts wheezing like it ran a marathon in wet jeans.

If that sounds familiar, good. You’re not cursed. You’re running into architecture debt that Open Mirroring does not solve for you.

Open Mirroring in Microsoft Fabric does exactly what it says on the tin: it replicates data from external systems into OneLake as Delta tables, and schema changes in the source can flow through. That’s huge. It cuts out a pile of ingestion plumbing.

But mirroring only lands data. It does not guarantee your Spark reads will be fast, stable, or predictable. That’s your job.

This post is the practical playbook: what breaks, why it breaks, and the patterns that keep your Spark jobs from turning into slow-motion disasters.

first principle: mirrored is a landing zone, not a serving layer

Treat mirrored tables like an airport runway. Planes touch down there. People do not set up a picnic on the tarmac.

When teams read mirrored tables directly in hot-path jobs, they inherit whatever file layout the connector produced. Sometimes that layout is fine. Sometimes it is a junk drawer.

Spark is sensitive to this. Reading many tiny files creates scheduling and metadata overhead. Reading a few huge files kills parallelism. Either way, the cluster burns time doing the wrong work.

The fix is boring and absolutely worth it: add a curated read layer.

Let Open Mirroring write into a dedicated mirror lakehouse.
Run a post-mirror notebook that reshapes data for Spark (partitioning, compaction, cleanup).
Have production notebooks read curated tables only.

One extra hop. Much better nights of sleep.

what actually causes the latency cliff

Two things usually punch you in the face at scale:

File layout drift
Schema drift

Let’s tackle them in order.

1) file layout drift (the silent killer)

Spark scheduling is roughly file-driven for Parquet/Delta scans. That means file shape becomes execution shape. If your table has wildly uneven files, your job speed is set by the stragglers.

Think of ten checkout lanes where nine customers have one item and one customer has a full garage sale cart. Everyone waits on that last lane.

Start by measuring file distribution, not just row counts.

from pyspark.sql import functions as F

# NOTE: inputFiles() returns a Python list of file paths
df = spark.read.format("delta").load("Tables/raw_mirrored_orders")
paths = df.inputFiles()

# Use Hadoop FS to get file sizes in bytes
jvm = spark._jvm
conf = spark._jsc.hadoopConfiguration()
fs = jvm.org.apache.hadoop.fs.FileSystem.get(conf)

sizes = []
for p in paths:
    size = fs.getFileStatus(jvm.org.apache.hadoop.fs.Path(p)).getLen()
    sizes.append((p, size))

size_df = spark.createDataFrame(sizes, ["path", "size_bytes"])

size_df.select(
    F.count("*").alias("file_count"),
    F.round(F.avg("size_bytes")/1024/1024, 2).alias("avg_mb"),
    F.round(F.expr("percentile_approx(size_bytes, 0.5)")/1024/1024, 2).alias("p50_mb"),
    F.round(F.expr("percentile_approx(size_bytes, 0.9)")/1024/1024, 2).alias("p90_mb"),
    F.round(F.max("size_bytes")/1024/1024, 2).alias("max_mb")
).show(truncate=False)

You want a tight-ish band, not chaos. A common rule of thumb is targeting roughly 128 MB to 512 MB Parquet files for balanced throughput and parallelism. Rule of thumb, not religion. Your workload decides final tuning.

Then enforce a sane shape in curated tables:

raw = spark.read.format("delta").load("Tables/raw_mirrored_orders")

(raw.write
    .format("delta")
    .mode("overwrite")
    .partitionBy("order_date")         # choose columns your queries actually filter on
    .option("maxRecordsPerFile", 500000)
    .save("Tables/curated_orders"))

spark.sql("OPTIMIZE delta.`Tables/curated_orders`")

If your queries filter by date and region, but you partition by customer_id because it “felt right,” you built a latency trap with your own hands.

2) schema drift (the 3 a.m. pager)

Open Mirroring can propagate source schema changes. That’s useful and dangerous.

Useful because your lake stays aligned. Dangerous because downstream logic often assumes a fixed shape.

A nullable column addition is usually fine. A type shift on a key metric column can quietly corrupt aggregations or explode at runtime.

Do not “notice this later.” Gate on it.

from pyspark.sql.types import StructType
import json

# Store baseline schema as JSON in Files/schemas/orders_baseline.json
with open("/lakehouse/default/Files/schemas/orders_baseline.json", "r") as f:
    baseline = StructType.fromJson(json.load(f))

current = spark.read.format("delta").load("Tables/raw_mirrored_orders").schema

base = {f.name: str(f.dataType) for f in baseline.fields}
curr = {f.name: str(f.dataType) for f in current.fields}

type_changes = [
    f"{name}: {base[name]} -> {curr[name]}"
    for name in curr
    if name in base and base[name] != curr[name]
]

new_cols = [name for name in curr if name not in base]

if type_changes:
    raise ValueError(f"Schema type changes detected: {type_changes}")

# Optional policy: allow new nullable columns but log them
if new_cols:
    print(f"New columns detected: {new_cols}")

Policy matters more than code here. Decide in advance what is auto-accepted versus what blocks the pipeline. Write it down. Enforce it every run.

lag is real, even when everything is healthy

Mirroring pipelines are replication systems, not teleportation devices. There is always some delay between source commit and mirrored availability. Sometimes tiny. Sometimes not.

If your job blindly processes “last hour” windows without checking mirror freshness, you’ll create holes and call them “data quality issues” three weeks later.

Add a freshness gate before transformations. The metadata source is connector-specific, but the pattern is universal:

from datetime import datetime, timedelta, timezone

# Example only: use the metadata table/view exposed by your mirroring setup
last_mirror_ts = spark.sql("""
  SELECT max(replication_commit_ts) as ts
  FROM mirror_metadata.orders_status
""").collect()[0]["ts"]

required_freshness = datetime.now(timezone.utc) - timedelta(minutes=15)

if last_mirror_ts is None or last_mirror_ts < required_freshness:
    raise RuntimeError(
        f"Mirror not fresh enough. Last commit: {last_mirror_ts}, required after: {required_freshness}"
    )

No freshness, no run. That one line saves you from publishing confident nonsense.

the production checklist (use this before go-live)

Before promoting any mirrored-data Spark pipeline, run this checklist in the same capacity and schedule window as production:

File shape check
Measure file count and distribution (p50, p90, max).
If distribution is ugly, compact and rewrite in curated.
Partition sanity check
Confirm partitions match real filter predicates.
Use df.explain(True) and verify PartitionFilters is not empty for common queries.
Schema gate check
Compare current schema to baseline.
Fail on type changes unless explicitly approved.
Freshness gate check
Validate mirrored data is fresh enough for your downstream SLA.
Fail fast if not.
Throughput reality check
Time representative full and filtered scans from curated tables.
If runtime misses SLA, fix layout first, then tune compute.

If you only do one thing from this list, do the curated layer. Direct reads from mirrored tables are the root of most performance horror stories.

architecture that holds up when volume gets ugly

Keep it simple:

Mirror layer
Open Mirroring lands source data in OneLake Delta tables. This is your raw replica.
Curation job
A scheduled Spark notebook validates schema, reshapes partitions, and compacts files.
Curated layer
Downstream Spark notebooks and SQL consumers read here, not from mirror tables.
Freshness gate
Every downstream run checks replication freshness before processing.

That’s it. No heroics. No mystery knobs. Just a clean boundary between “data landed” and “data is ready to serve.”

Open Mirroring is genuinely powerful, but it is not magic. If you treat mirrored tables as production-ready serving tables, latency will eventually kneecap you. If you treat them as a landing zone and curate aggressively, Spark behaves, stakeholders stay calm, and your weekends stay yours.

This post was written with help from anthropic/claude-opus-4-6

What the February 2026 Fabric Influencers Spotlight means for your Spark team

Microsoft published its February 2026 Fabric Influencers Spotlight last week. Twelve community posts. MVPs and Super Users. Most people skim the list. Maybe bookmark a link. Move on.

Don’t.

Three of those posts carry signals that should change how your Spark data-engineering team operates in production. Not next quarter. Now.

Signal 1: Get your production code out of notebooks

Matthias Falland’s Fabric Friday episode makes the case plainly: notebooks are great for development but risky in production. That framing resonates with a lot of production teams—and for good reason.

Here’s the nuance. Microsoft has said there’s no inherent difference in performance or monitoring capabilities between Spark Job Definitions and notebooks. Both produce Spark logs. Both run on the same compute. The gap isn’t in what the platform offers. It’s in what each artifact encourages.

Notebooks encourage improvisation. Someone edits a cell at 2 AM. Cell state carries between runs. An error gets swallowed inside an output cell and nobody notices until downstream tables go stale. That’s not a platform limitation. That’s a human-factors problem. And production environments are where human-factors problems become outages.

Spark Job Definitions push you toward cleaner habits. One file per job. No cell state. Explicit parameters. Better modularity. The execution boundary is sharper, and sharper boundaries make failures easier to diagnose.

If your team runs notebooks on a schedule through pipelines, here’s the migration:

Audit every notebook that runs on a schedule or gets triggered by a pipeline. Count them. You’ll be surprised.
Extract the transformation logic into standalone Python or Scala files. One file per job. No magic. No “run all cells.”
Create Spark Job Definitions for each. Map your existing notebook parameters to SJD parameters. They work the same way—just without the cell baggage.
Wire them into your pipeline activities. Replace the notebook activity with an SJD activity. The orchestration stays identical.
Keep the notebooks for development and ad-hoc exploration. That’s where they shine.

A team of three can typically convert a dozen notebooks in a week. The hard part isn’t the migration. It’s the decision to start.

Signal 2: Direct Lake changes how you write to your lakehouse

Pallavi Routaray’s post on Direct Lake architecture is the most consequential piece in the whole spotlight. Easy to miss because the title sounds like a Power BI topic.

It’s not. It’s a Spark topic.

Direct Lake mode reads Parquet files directly from OneLake. No import copy. No DirectQuery overhead. But it only works well if your Spark jobs write data in a way that Direct Lake can consume efficiently. Get the file layout wrong and your semantic model falls back to DirectQuery silently. Performance craters. Your BI team blames you. Nobody knows why.

Here’s the production checklist:

Enable V-Order optimization on your Delta tables. V-Order sorts and compresses Parquet files for Direct Lake’s columnar read path. Here’s the catch: V-Order is disabled by default in new Fabric workspaces, optimized for write-heavy data engineering workloads. If your workspace was created recently, you need to enable it explicitly. Check your workspace settings—or set it at the table property level. Don’t assume it’s on.
Control your file sizes. Microsoft’s guidance is clear: keep the number of Parquet files small and use large row groups. If your Spark jobs produce thousands of tiny files, Direct Lake will hit its file-count limits and fall back. Run OPTIMIZE on your Delta tables after write operations. Compact aggressively.
Partition deliberately. Over-partitioning creates too many small files. Under-partitioning creates files that are too large for efficient column pruning. Partition by the grain your BI team actually filters on. Ask them. Don’t guess.
Watch for schema drift. Direct Lake models bind to specific columns at creation time. If your Spark job adds or renames a column, the semantic model breaks. Coordinate schema changes explicitly. No silent ALTER TABLE commands on Friday afternoons.

The big risk here: most Spark teams don’t know their output feeds a Direct Lake model. The BI team built it after the fact. Start by mapping which of your Delta tables have Direct Lake semantic models sitting on top. If you don’t know, find out today.

Signal 3: CI/CD for Fabric just got real

Kevin Chant’s post covers the fabric-cicd tool reaching general availability for configuration-based deployments with Azure DevOps. This is verified and it matters more than it sounds.

Until now, deploying Fabric artifacts across environments—dev, test, prod—was either manual or held together with custom scripts that broke every time the API changed. The fabric-cicd tool gives you a supported, versioned path.

For Spark teams:

Your Spark Job Definitions, lakehouse configurations, and pipeline definitions can live in source control and deploy through a proper pipeline. No more “I’ll just update it in the portal.”
Configuration differences between environments—connection strings, capacity settings, lakehouse names—get handled through configuration files. Not by editing items in the portal after deployment.
You can roll back. You can diff. You can review before promoting to production. The basic hygiene that every other engineering discipline has had for decades.

Here’s the migration path:

Install fabric-cicd from the latest release. Follow Chant’s posts for the Azure DevOps YAML pipeline specifics.
Export your existing workspace items to a Git repository. Fabric’s Git integration handles this natively.
Build your environment-specific configuration files. One per environment. Map the items that differ: capacity, lakehouse, connections.
Set up your Azure DevOps pipeline to run fabric-cicd on merge to main. Start with dry-run mode until you trust it.
Remove portal-level edit access for production workspaces. This is the hard step. It’s also the one that prevents the next outage.

The deeper pattern

These three signals connect. Falland tells you to move your Spark code into artifacts built for production discipline. Routaray tells you how to write your output so downstream models don’t silently degrade. Chant tells you how to deploy the whole thing reliably across environments.

That’s a production pipeline. End to end. Code that runs cleanly, writes data correctly, and deploys safely.

The February spotlight also includes Open Mirroring hands-on guidance from Inturi Suparna Babu and a Fabric Data Agent walkthrough from Shubham Rai. Both are worth a read if you’re evaluating data replication strategies or AI-assisted query patterns over your lakehouse. But for Spark teams running production workloads, the three signals above are where the action is.

Your rollout checklist for March

Inventory all scheduled notebooks. Tag them by risk: frequency, data volume, downstream dependencies.
Convert the highest-risk notebook to a Spark Job Definition this week. Validate it runs identically.
Audit Delta table write patterns for any table that feeds a Direct Lake model. Check that V-Order is enabled. Run OPTIMIZE to compact files.
Install fabric-cicd. Connect your workspace to Git. Build your first environment config.
Pick one pipeline to deploy through CI/CD end-to-end. Prove it works before scaling.

Five items. All concrete. All doable in March.

The community did the research. Your job is to act on it.

This post was written with help from anthropic/claude-opus-4-6

Keeping Spark, OneLake, and Mirroring Reliable in Microsoft Fabric

The alert fired at 2:14 AM on a Tuesday. A downstream Power BI report had gone stale — the Direct Lake dataset hadn’t refreshed in six hours. The on-call engineer opened the Fabric monitoring hub and found a cascade: three Spark notebooks had completed without triggering downstream freshness checks, a mirrored database was five hours behind, and the OneLake shortcut connecting them was returning intermittent 403 errors. It went undetected until a VP’s morning dashboard showed yesterday’s numbers.

That scenario is stressful, but it’s also solvable. These issues are usually about observability gaps between services, not broken fundamentals. If you’re running Spark workloads against OneLake with mirroring in Microsoft Fabric, you’ll likely encounter some version of this under real load. The key is having an operational playbook before it happens.

What follows is that playbook — assembled from documented production incidents, community post-mortems, and repeatable operating patterns from teams running this architecture at scale.

How Spark, OneLake, and mirroring connect (and where they don’t)

The dependency chain matters because issues can cascade through it in non-obvious ways.

Your Spark notebooks write Delta tables to OneLake lakehouses. Those tables might feed Direct Lake datasets in Power BI. Separately, Mirroring can replicate data from external sources — Azure SQL Database, Cosmos DB, Snowflake, and others — into OneLake as Delta tables. Shortcuts bridge lakehouses or reference external storage.

What makes this operationally nuanced: each layer has its own retry logic, auth tokens, and completion semantics. A Spark job can succeed from its own perspective (exit code 0, no exceptions) while the data it wrote is temporarily unavailable to downstream consumers because of a metadata sync delay. Mirroring can pause during source throttling and may not raise an immediate alert unless you monitor freshness directly. Shortcuts can go stale when target workspace permissions change.

You can end up with green pipelines and incomplete data. The gap between “the job ran” and “the data arrived correctly” is where most reliability work lives.

Detection signals you actually need

The first mistake teams make is relying on Spark job status alone. A job that completes successfully but writes zero rows, hits an unmonitored schema drift, or writes to the wrong partition is still a data quality issue.

Here’s what to watch instead:

Row count deltas. After every notebook run, compare the target table’s row count against expected intake. It doesn’t need to be exact — a threshold works. If the delta table grew by less than 10% of its average daily volume, fire a warning. Three lines of Spark SQL at the end of your notebook. Five minutes to implement. It prevents empty-table surprises at 9 AM.

OneLake file freshness. The _delta_log folder in your lakehouse tables contains JSON commit files with timestamps. If the most recent commit is older than your pipeline cadence plus a reasonable buffer, investigate. A lightweight monitoring notebook that scans these timestamps across key tables takes about twenty minutes to build.

Mirroring lag via canary rows. The monitoring hub shows mirroring status, but the granularity is coarse. For external databases, set up a canary: a table in your source that gets a timestamp updated every five minutes. Check that timestamp on the OneLake side. If the gap exceeds your SLA, you know mirroring is stalled before your users do.

Shortcut health checks. Shortcuts can degrade quietly when no direct check exists. A daily job that reads a single row from each shortcut target and validates the response catches broken permissions, expired SAS tokens, and misconfigured workspace references before they cause real damage.

Failure mode 1: the Spark write that succeeds but isn’t queryable yet

You’ll see this in Fabric notebook logs as a clean run. The Spark job processed data, performed transformations, called df.write.format("delta").mode("overwrite").save(). Exit code 0. But the data isn’t queryable from the SQL analytics endpoint, and Direct Lake still shows stale numbers.

What happened: the SQL analytics endpoint runs a separate metadata sync process that detects changes committed to lakehouse Delta tables. According to Microsoft’s documentation, under normal conditions this lag is less than one minute. But it can occasionally fall behind — sometimes significantly. The Fabric community has documented sync delays stretching to hours, particularly during periods of high platform load or when tables have large numbers of partition files.

This is the gap that catches teams off guard. The Delta commit landed in storage, but the SQL endpoint hasn’t picked it up yet.

Triage sequence:

Open the lakehouse in Fabric and check the table directly via the lakehouse explorer. If the data appears there but not in the SQL endpoint, you’ve confirmed a metadata sync lag.
Check Fabric capacity metrics. If your capacity is throttled (visible in the admin portal under capacity management), metadata sync can be deprioritized. Burst workloads earlier in the day can surface as sync delays later.
Force a manual sync. In the SQL analytics endpoint, select “Sync” from the table context menu. You can also trigger this programmatically — Microsoft released a Refresh SQL Analytics Endpoint Metadata REST API (preview as of mid-2025), and it’s also available through the semantic-link-labs Python package.

Remediation: Add a post-write validation step to your notebooks. After writing the Delta table, wait 30 seconds, then query the SQL analytics endpoint for the max timestamp or row count. If it doesn’t match what you wrote, log a warning and retry the sync. If after three retries it still diverges, fail the pipeline explicitly so your alerting catches it. Don’t let a successful Spark job mask a downstream data gap.

Failure mode 2: mirroring goes quiet

Mirroring is genuinely useful for getting external data into OneLake without building custom pipelines. But one common reliability pattern is that replication can stall when the source system throttles or times out, and the monitoring hub may still show “Running” while data freshness drifts.

This pattern is often observed with Azure SQL Database sources during heavy read periods. The mirroring process opens change tracking connections that compete with production queries. When the source database gets busy, it can throttle the mirroring connection, and Fabric retry logic may back off for extended periods without immediately surfacing a hard error.

Triage sequence:

Check mirroring status in the monitoring hub, but prioritize the “Last synced” timestamp over the status icon. “Running” with a last-sync time of four hours ago still indicates a problem.
Check the source database’s connection metrics. If you’re mirroring from Azure SQL, look at DTU consumption and connection counts around the time mirroring lag increased. There’s often a correlation with a batch job or reporting burst.
Inspect table-level mirroring status. Individual tables can fall behind while others sync normally. The monitoring hub aggregates this, which can hide partial lag.

Remediation: The canary-row pattern is your early warning system. For prevention, stagger heavy source-database workloads away from mirroring windows. If your Azure SQL is Standard tier, increasing DTU capacity or moving to Hyperscale gives mirroring more room. On the Fabric side, stopping and restarting mirroring resets the connection and forces a re-sync when retry backoff has become too aggressive.

Failure mode 3: shortcut permissions drift

Shortcuts are the connective tissue of OneLake — references across lakehouses, workspaces, and external storage without copying data. They deliver huge flexibility, but they benefit from explicit permission and token hygiene.

A common failure pattern: a shortcut that worked for months suddenly returns 403 errors or empty results. Spark notebooks that read from the shortcut either fail with ADLS errors or complete with zero rows if downstream checks aren’t strict.

Root causes, ranked by observed frequency in the field:

A workspace admin changed role assignments, and the identity the shortcut was created under lost access. Usually accidental.
For ADLS Gen2 shortcuts: the SAS token expired, or storage account firewall rules changed.
Cross-tenant shortcuts relying on Entra ID B2B guest access. If guest policy changes on either tenant, shortcuts can break without a prominent Fabric notification.

Triage sequence:

Open the shortcut definition in the lakehouse — Fabric shows a warning icon on broken shortcuts, but only in the lakehouse explorer.
Test the shortcut target independently. Can you access the target lakehouse or storage account directly with the same identity? If not, it’s a permissions issue.
For ADLS shortcuts, check storage account access logs in Azure Monitor. Look for 403 responses from Fabric service IP ranges.

Remediation: Use service principals with dedicated Fabric permissions rather than user identities for shortcuts. Set up a token rotation calendar with 30-day overlap between old and new tokens so you’re never caught by a hard expiration. Then keep a daily shortcut health-check job that reads one row from each shortcut target and validates expected row counts.

Failure mode 4: capacity throttling disguised as five different problems

This one is tricky because it can look like unrelated issues at once. Spark jobs slow down. Metadata syncs lag. Mirroring falls behind. SQL endpoint queries time out. Power BI reports go stale. Troubleshoot each symptom in isolation and you’ll end up looping.

The common thread: your Fabric capacity hit its compute limits and started throttling. Fabric uses a bursting and smoothing model — you can temporarily exceed your purchased capacity units, but that overuse gets smoothed across future time windows. The system recovers by throttling subsequent operations. A heavy Spark job at 10 AM can degrade Power BI performance at 3 PM unless capacity planning accounts for that delayed impact.

Triage sequence:

Open the capacity admin portal and look at the CU consumption graph. Sustained usage above 100% followed by throttling bands is your signal.
Identify top CU consumers. Spark notebooks and materialization operations (Direct Lake refreshes, semantic model processing) tend to be the heaviest. Capacity metrics break this down by workload type.
Check the throttling policy and current throttling state. Fabric throttles interactive workloads first when background usage exceeds limits — meaning end users feel pain from batch jobs they never see.

Remediation: Separate workloads by time window. Push heavy Spark processing to off-peak hours. If you can’t shift the schedule, split workloads across multiple capacities — batch on one, interactive analytics on another. Set CU consumption alerts at 80% of capacity so you get warning before throttling starts.

For bursty Spark demand, also evaluate Spark Autoscale Billing. In the current Fabric model, Autoscale Billing is opt-in per capacity and runs Spark on pay-as-you-go serverless compute, so Spark jobs don’t consume your fixed Fabric CU pool. That makes it a strong option for ad-hoc spikes or unpredictable processing windows where manual SKU up/down management is too slow.

If your workload is predictable, pre-scaling SKU windows (for example, F32 to F64 before a known processing block) can still be effective — just manage cost guardrails and rollback timing tightly.

Assembling the runbook

A playbook works only if it’s accessible and actionable when the alert fires at 2 AM. Here’s how to structure it:

Tier 1 — automated checks (every pipeline cycle):
– Post-write row count validation in every Spark notebook
– Canary row freshness for every mirrored source
– _delta_log timestamp scan across key tables

Tier 2 — daily health checks (scheduled monitoring job):
– Shortcut validation: read one row from every shortcut target
– Capacity CU trending: alert if 7-day rolling average exceeds 70%
– Mirroring table-level lag report (not just aggregate status)

Tier 3 — incident response (when alerts fire):
– Start with capacity metrics. If throttling is active, it’s often the shared root cause behind multi-symptom incidents.
– Check mirroring “Last synced” timestamps. Don’t rely on status icons alone.
– For Spark write issues, verify SQL endpoint sync state independently from the Delta table itself.
– For shortcut errors, test target identity access directly outside of Fabric.

Fabric gives you powerful primitives: Spark at scale, OneLake as a unified data layer, and mirroring that removes a lot of custom ingestion plumbing. With cross-service monitoring and a practical runbook, these patterns become manageable operational events instead of recurring surprises.

This post was written with help from anthropic/claude-opus-4-6

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

There’s a connector that ships with every Fabric Spark runtime. It’s pre-installed. It requires no setup. And it lets your Spark notebooks read from—and write to—Fabric Data Warehouse tables as naturally as they read Delta tables from a Lakehouse.

Most Fabric Spark users don’t know it exists. The ones who do often run into the same three or four surprises. Let’s fix both problems.

What the connector actually is

The Spark connector for Fabric Data Warehouse (synapsesql) is a built-in extension to the Spark DataFrame API. It uses the TDS protocol to talk directly to the SQL engine behind your Warehouse (or the SQL analytics endpoint of a Lakehouse). You get read and write access to Warehouse tables from PySpark, Scala Spark, or Spark SQL.

One line of code to read:

from com.microsoft.spark.fabric.Constants import Constants  df = spark.read.synapsesql("my_warehouse.dbo.sales_fact")

One line to write:

df.write.mode("append").synapsesql("my_warehouse.dbo.sales_fact")

No connection strings. No passwords. No JDBC driver management. Authentication flows through Microsoft Entra—same identity you’re logged into your Fabric workspace with. The connector resolves the SQL endpoint automatically based on workspace context.

That’s the happy path. Now let’s talk about what actually happens when you use it.

Reading: the part that mostly just works

Reading from a Warehouse table into a Spark DataFrame is the connector’s strength. The synapsesql() call supports the full three-part naming convention: warehouse_name.schema_name.table_or_view_name. It works for tables and views, including views with joins across schemas.

A few things that are genuinely useful:

Predicate pushdown works. When you chain .filter() or .limit() onto your DataFrame, the connector pushes those constraints to the SQL engine. You’re not pulling the full table into Spark memory and then filtering—the SQL engine handles the filter and sends back the subset. This matters when your Warehouse tables have hundreds of millions of rows and you only need a time-sliced sample.

df = spark.read.synapsesql("my_warehouse.dbo.sales_fact") \     .filter("order_date >= '2026-01-01'") \     .select("order_id", "customer_id", "amount")

Cross-workspace reads work. If your Warehouse lives in a different workspace than your notebook’s attached Lakehouse, you pass the workspace ID:

df = spark.read \     .option(Constants.WorkspaceId, "<target-workspace-id>") \     .option(Constants.DatawarehouseId, "<warehouse-item-id>") \     .synapsesql("my_warehouse.dbo.sales_fact")

This is genuinely powerful for hub-and-spoke architectures where your curated Warehouse sits in a production workspace and your data science notebooks live in a sandbox workspace.

Parallel reads are available. For large tables, you can partition the read across multiple Spark tasks, similar to spark.read.jdbc:

df = spark.read \     .option("partitionColumn", "order_id") \     .option("lowerBound", 1) \     .option("upperBound", 10000000) \     .option("numPartitions", 8) \     .synapsesql("my_warehouse.dbo.sales_fact")

This splits the query into eight parallel reads, each fetching a range of order_id. Without this, you get a single-threaded read that will bottleneck on large tables.

Security models pass through. If your Warehouse has object-level security (OLS), row-level security (RLS), or column-level security (CLS), those policies are enforced when Spark reads the data. Your notebook sees exactly what your identity is authorized to see. This is a meaningful difference from reading Delta files directly via OneLake, where security operates at the workspace or folder level.

Custom T-SQL queries work too. You’re not limited to reading tables—you can pass arbitrary T-SQL:

df = spark.read \     .option(Constants.DatabaseName, "my_warehouse") \     .synapsesql("SELECT TOP 1000 * FROM dbo.sales_fact WHERE region = 'WEST'")

This is handy for complex aggregations or when you want the SQL engine to do the heavy lifting before data enters Spark.

Writing: the part with surprises

Write support for the Spark-to-Warehouse connector became generally available with Runtime 1.3. It works, and it solves a real architectural problem—but it has mechanics you need to understand.

How writes actually work under the hood

The connector uses a two-phase process:

Stage: Spark writes your DataFrame to intermediate Parquet files in a staging location.
Load: The connector issues a COPY INTO command, telling the Warehouse SQL engine to ingest from the staged files.

This is the same COPY INTO pattern that powers bulk ingestion into Fabric Data Warehouse generally. It’s optimized for throughput. It is not optimized for latency on small writes.

If you’re writing a DataFrame with 50 rows, the overhead of staging files and issuing COPY INTO means the write takes materially longer than you’d expect. For small, frequent writes, this connector is not the right tool. Use T-SQL INSERT statements through a SQL connection instead.

For batch writes of thousands to millions of rows, the connector performs well. The COPY INTO path is what the Warehouse was designed for.

Save modes

The connector supports four save modes:

errorifexists (default): Fails if the table already exists.
ignore: Silently skips the write if the table exists.
overwrite: Drops and recreates the table with new data.
append: Adds rows to the existing table.

df.write.mode("overwrite").synapsesql("my_warehouse.dbo.daily_aggregates")

A common pattern: Spark computes daily aggregations from Lakehouse Delta tables, then writes the results to a Warehouse table that Power BI reports connect to. The Warehouse’s result set caching (now generally available as of January 2026) means subsequent Power BI refreshes hit cache instead of recomputing.

The timestamp_ntz gotcha

This is the single most common error people hit when writing to a Warehouse from Spark.

If your DataFrame contains timestamp_ntz (timestamp without time zone) columns, the write will fail. Fabric Data Warehouse expects time-zone-aware timestamps. The fix is a cast before you write:

from pyspark.sql.functions import col  for c in df.columns:     if dict(df.dtypes)[c] == "timestamp_ntz":         df = df.withColumn(c, col(c).cast("timestamp"))  df.write.mode("append").synapsesql("my_warehouse.dbo.target_table")

This is not documented prominently enough. If you see a Py4JJavaError during write that mentions type conversion, timestamps are the first thing to check.

What you can’t write to

The connector writes to Warehouse tables only. You cannot write to the SQL analytics endpoint of a Lakehouse—it’s read-only. If you try, you’ll get an error. This seems obvious but trips people up because the same synapsesql() method handles both reads from Warehouses and Lakehouse SQL endpoints.

Private Link limitations

If Private Link is enabled at the workspace level, both read and write operations through the connector are unsupported. If Private Link is enabled at the tenant level only, writes are unsupported but reads still work. This is a significant limitation for security-conscious deployments. Check your network configuration before building pipelines that depend on this connector.

Time Travel is not supported

Fabric Data Warehouse now supports Time Travel queries. However, the Spark connector does not pass through Time Travel syntax. If you need to query a table as of a specific point in time, you’ll need to use a T-SQL connection directly rather than the synapsesql() method.

When to use Warehouse vs. Lakehouse as your serving layer

This is the architectural question that the connector’s existence forces you to answer. You’ve got data in your Lakehouse. Spark has transformed it. Now where does it go?

Use Lakehouse Delta tables when:

Your consumers are other Spark notebooks or Spark-based ML pipelines.
You need schema evolution flexibility (Delta’s schema merge).
You want to use OPTIMIZE, VACUUM, and Z-ORDER for table maintenance.
Your data scientists need direct file access through OneLake APIs.

Use Warehouse tables when:

Your primary consumers are Power BI reports or T-SQL analysts.
You need the Warehouse’s result set caching for repeated query patterns.
You need fine-grained security (RLS, CLS, OLS) that passes through to all consumers.
You want to use T-SQL stored procedures, views, and MERGE statements for downstream transformations.
You need cross-database queries that join Warehouse tables with Lakehouse tables or other Warehouse tables.

Use both when:

Spark processes and stores data in the Lakehouse (bronze → silver → gold medallion layers), then the connector writes final aggregations or serving tables to the Warehouse.
The Warehouse serves as the “last mile” between your data engineering work and your business intelligence layer.

The January 2026 GA of MERGE in Fabric Data Warehouse makes the “write to Warehouse” pattern significantly more useful. You can now do incremental upserts: Spark writes a staging table, then a T-SQL MERGE reconciles it with the target. This is a common pattern in data warehousing that was previously awkward in Fabric.

A concrete pattern: Spark ETL → Warehouse serving layer

Here’s the pattern I see working well in production:

# 1. Read from Lakehouse Delta tables (Spark native) bronze = spark.read.format("delta").load("Tables/raw_orders")  # 2. Transform in Spark silver = bronze.filter(col("status") != "cancelled") \     .withColumn("order_date", col("order_ts").cast("date")) \     .withColumn("amount_usd", col("amount") * col("fx_rate"))  gold = silver.groupBy("region", "order_date") \     .agg(         count("order_id").alias("order_count"),         sum("amount_usd").alias("total_revenue")     )  # 3. Write to Warehouse for Power BI consumption gold.write.mode("overwrite").synapsesql("analytics_warehouse.dbo.daily_revenue")

The Lakehouse owns the raw and transformed data. Spark does the heavy compute. The Warehouse serves the final tables to downstream consumers with T-SQL access, caching, and fine-grained security.

The alternative—writing gold tables to the Lakehouse and having Power BI connect via the SQL analytics endpoint—also works. But the SQL analytics endpoint has a metadata sync delay after Spark writes new data. The Warehouse table is immediately consistent after the COPY INTO completes. If your reporting needs to reflect the latest pipeline run without a sync lag, the Warehouse path is more reliable.

Cross-database queries: the glue between them

Once you have data in both a Lakehouse and a Warehouse in the same workspace, you can query across them using T-SQL cross-database queries from the Warehouse:

SELECT w.customer_id, w.total_revenue, l.customer_segment FROM analytics_warehouse.dbo.daily_revenue AS w JOIN my_lakehouse.dbo.customer_dim AS l     ON w.customer_id = l.customer_id

This means your Warehouse doesn’t need to contain all the data. It can hold the curated aggregations while joining against dimension tables that live in the Lakehouse. No data movement. No duplication. The SQL engine resolves both sources through OneLake.

Performance notes from the field

A few observations from real workloads:

Reads are faster than you expect. The TDS protocol connection to the Warehouse SQL engine is efficient. For typical analytical queries returning thousands to low millions of rows, the synapsesql() read is competitive with reading Delta files directly, especially when the Warehouse has statistics and result set caching enabled.

Writes are slower than Lakehouse writes. The two-phase staging + COPY INTO process adds overhead versus a direct df.write.format("delta").save() to Lakehouse tables. For a DataFrame with 10 million rows, expect the Warehouse write to take 2-5x longer than an equivalent Lakehouse Delta write. This is the tradeoff for getting immediate T-SQL access with full Warehouse capabilities.

Use parallel reads for large tables. The default single-partition read will bottleneck. Set numPartitions to match your Spark cluster’s available cores for large reads. The performance improvement is often 4-8x.

Proactive and incremental statistics refresh. As of January 2026, Fabric Data Warehouse supports proactive statistics refresh and incremental statistics. This means the query optimizer keeps statistics up to date automatically. Your synapsesql() reads benefit from better query plans without manual UPDATE STATISTICS calls.

The honest summary

The Spark connector for Fabric Data Warehouse is a well-designed bridge between two systems that many teams use side by side. It makes the read path simple and the write path possible without leaving your Spark notebook.

It is not a replacement for writing to Lakehouse Delta tables. It is an additional output path for when your downstream consumers need T-SQL, fine-grained security, result set caching, or immediate consistency. Use it when the Warehouse is the right serving layer. Don’t use it when Lakehouse is sufficient.

The biggest wins come from combining both: Spark for compute, Lakehouse for storage, Warehouse for serving. The connector is the plumbing that makes that architecture work without data pipelines in between.

If you’re heading to FabCon Atlanta (March 16-20, 2026), both the Data Warehouse and Data Engineering teams will be there. It’s a good place to pressure-test your architecture and see what’s coming next.

This post was written with help from anthropic/claude-opus-4-6

Microsoft Fabric Warehouse + Spark: Interoperability Patterns That Actually Work

If you’ve spent any time in a Fabric workspace with both Data Engineering (Spark) and Data Warehouse, you’ve probably had this moment:

Spark is great for big transformations, complex parsing, and “just let me code it.”
The Warehouse is great for a curated SQL model, concurrency, and giving the BI world a stable contract.
And yet… teams still end up copying data around like they’re paid by the duplicate.

The good news: Fabric’s architectural bet is that OneLake + Delta is the contract surface across engines. That means you can design a pipeline where Spark and Warehouse cooperate instead of competing.

This post is a practical field guide to the integration patterns that work well in real projects:

3-part naming over the SQL endpoint (zero-copy default) – query Lakehouse Delta tables directly from Warehouse SQL without moving data.
Spark → Warehouse (file-based ingest) using COPY INTO and OPENROWSET over OneLake paths – when workload evidence calls for materialization.
Spark → Warehouse (table-to-table ingest) using cross-database queries / CTAS / INSERT…SELECT – same trigger.
Warehouse → Spark (read-only consumption) by reading the Warehouse table’s published Delta logs from Spark.

Along the way, I’ll call out the trade-offs, the gotchas, and the operational guardrails that keep teams out of trouble.

Mental model: OneLake is the handshake

In Fabric, multiple experiences can produce and consume Delta Lake tables. Microsoft Learn describes Delta Lake as the standard analytics table format in Fabric, and notes that Delta tables produced by one engine (including Fabric Data Warehouse and Spark) can be consumed by other engines.

So instead of thinking “Spark output” and “Warehouse tables” as two unrelated worlds, treat them as:

A shared storage plane (OneLake)
An open table format (Delta + Parquet)
Two compute engines with different strengths

The rest is just choosing where to materialize — or whether to materialize at all.

Start here: 3-Part Naming over the SQL Endpoint

Before you copy anything, ask: do I actually need a separate materialized table?

Fabric’s SQL analytics endpoint automatically exposes every Lakehouse Delta table as a queryable SQL object. From the Warehouse, you can reference those tables directly using 3-part naming:

SELECT * FROM MyLakehouse.dbo.clean_sales WHERE OrderDate >= '2026-01-01';

No COPY INTO. No CTAS. No duplicate storage. The query runs against the Lakehouse’s Delta files through the SQL endpoint — zero-copy interoperability out of the box.

When this is enough (and it often is)

Ad-hoc analytics and exploration across Spark-produced datasets.
Lightweight joins between Warehouse dimensions and Lakehouse facts.
BI semantic models that don’t need sub-second concurrency at scale.
Early-stage projects where the workload profile isn’t settled yet.

When to materialize instead

Materialize into dedicated Warehouse tables (COPY INTO, CTAS, INSERT…SELECT) when workload evidence justifies it:

High concurrency: many concurrent queries hitting the same dataset consistently.
Recurring heavy joins/aggregations: repeated complex queries where pre-materialized tables measurably reduce compute.
Stricter SLA / CU predictability: when you need tighter control over query performance and capacity consumption.
Governance boundaries: when the Warehouse should own and version the serving-layer schema independently from the Lakehouse.

If none of those conditions apply, 3-part naming is the right default. You can always materialize later when the numbers say you should.

The CU tradeoff

Virtualization (3-part naming) shifts cost to query-time: every read traverses the SQL endpoint and pays CU at execution. Materialization (COPY INTO / CTAS) pays an ingestion and storage cost once, so repeated reads are faster and more predictable in CU terms. Neither is universally better — the right call depends on query frequency, data volume, and your capacity budget.

Pattern 1 — Spark → Warehouse via OneLake files (COPY INTO + OPENROWSET)

When to use it

Start with 3-part naming. Reach for COPY INTO / OPENROWSET file-based ingest only when workload evidence (sustained concurrency pressure, SLA requirements, or CU unpredictability) tells you virtualization isn’t enough. This pattern fits when:

Your Spark pipeline already produces files (Parquet/CSV/JSONL) under a Lakehouse Files path.
You need faster or more predictable query performance than the SQL endpoint provides for this dataset.
You want a clean separation: Spark writes files; Warehouse owns the serving tables.

Step 1: Write a “handoff” dataset from Spark

In Spark, write a handoff dataset into the Lakehouse Files area (not Tables). Conceptually:

(   df   .write   .mode("overwrite")   .format("parquet")   .save("Files/handoff/sales_daily/") )

Why Files? Because the Warehouse can point COPY INTO / OPENROWSET at file paths, and the Files area is designed to hold arbitrary file layouts.

Step 2: Inspect the file shape from the Warehouse (OPENROWSET)

Before you ingest, use OPENROWSET to browse a file (or a set of files) and confirm the schema is what you think it is.

Microsoft Learn documents that Fabric Warehouse OPENROWSET can read Parquet/CSV files, and that the files can be stored in Azure Blob Storage, ADLS, or Fabric OneLake (with OneLake reads called out as preview).

SELECT TOP 10 * FROM OPENROWSET(   BULK 'https://onelake.dfs.fabric.microsoft.com/<workspaceId>/<lakehouseId>/Files/handoff/sales_daily/*.parquet' ) AS rows;

Step 3: Ingest into a Warehouse table (COPY INTO)

The Fabric blog announcement for OneLake as a source for COPY INTO and OPENROWSET highlights the point of this feature: load and query Lakehouse file folders without external staging storage or SAS tokens.

COPY INTO dbo.SalesDaily FROM 'https://onelake.dfs.fabric.microsoft.com/<workspaceId>/<lakehouseId>/Files/handoff/sales_daily/' WITH (   FILE_TYPE = 'PARQUET' );

Operational guardrails

Treat the Files path as a handoff contract: version it, keep it predictable, and don’t “just drop random stuff in there.”
If you’ll query the same external data repeatedly, ingest it into a dedicated Warehouse table (Microsoft Learn notes repeated OPENROWSET access can be slower than querying a table).

Pattern 2 – Spark → Warehouse via in-workspace tables (CTAS / INSERT…SELECT)

When to use it

As with Pattern 1, start with 3-part naming and materialize via CTAS / INSERT…SELECT only when workload metrics confirm you need it. This pattern fits when:

Your Spark output is naturally a Delta table (Lakehouse Tables area) and 3-part naming queries against it hit concurrency or performance limits.
You want the Warehouse to own a curated serving-layer model (joins, dimensional modeling, computed columns) with predictable CU spend.
You prefer SQL-native table-to-table pipelines over file-level ingestion.

Step 1: Produce a curated Delta table with Spark

(   df_clean   .write   .mode("overwrite")   .format("delta")   .save("Tables/clean_sales") )

Step 2: Materialize a Warehouse table from the Lakehouse table

Microsoft Learn notes that for T-SQL ingestion, you can use patterns like INSERT…SELECT, SELECT INTO, or CREATE TABLE AS SELECT (CTAS) to create or update tables from other items in the same workspace (including lakehouses).

CREATE TABLE dbo.FactSales AS SELECT   OrderDate,   StoreId,   ProductId,   Quantity,   NetAmount FROM MyLakehouse.dbo.clean_sales;

For incremental loads you’ll often end up with a staging + merge strategy, but the key idea stays the same: Spark produces the curated dataset; the Warehouse owns the serving tables.

Pattern 3 – Warehouse → Spark via published Delta logs (read-only)

This is the pattern that surprises people (in a good way): the Warehouse isn’t a closed box.

Microsoft Learn documents that Warehouse user tables are stored in Parquet, and that Delta Lake logs are published for all user tables. The key consequence is that any engine that can read Delta tables can get direct access to Warehouse tables – read-only.

Step 1: Get the OneLake path for a Warehouse table

In the Warehouse UI, table Properties exposes the table’s URL / ABFS URI (Learn walks through the steps).

Step 2: Read the Warehouse table from Spark (read-only)

warehouse_table_path = "abfss://<workspace>@onelake.dfs.fabric.microsoft.com/<warehouseId>/Tables/dbo/FactSales"  fact_sales_df = spark.read.format("delta").load(warehouse_table_path)

This access is read-only from Spark. Writes must go through the Warehouse to maintain ACID compliance.
Delta log publishing is a background process after commits, so treat cross-engine visibility as “near real-time,” not “every millisecond.”

Bonus control: pause Delta log publishing

The same Learn doc describes an operational lever you can use when you need stability during a large set of changes:

ALTER DATABASE CURRENT SET DATA_LAKE_LOG_PUBLISHING = PAUSED; -- ... bulk updates ... ALTER DATABASE CURRENT SET DATA_LAKE_LOG_PUBLISHING = AUTO;

When publishing is paused, other engines see the pre-pause snapshot; Warehouse queries still see the latest.

Choosing an ownership model (so you don’t end up with two sources of truth)

The integration is easy. The contract is the hard part.

A simple rule that prevents a lot of pain:

If Spark is writing it: Warehouse can ingest it, but Spark owns the dataset.
If Warehouse is writing it: Spark can read it, but Warehouse owns the dataset.

In other words: pick one writer.

For most analytics teams, a good default is:

Spark owns bronze/silver (raw + cleaned Delta in the Lakehouse)
Warehouse owns gold (facts/dimensions, KPI-ready serving tables) — but “owns” doesn’t always mean “physically copies.” A cross-database query via 3-part naming can serve gold-layer reads without materialization.

Start with 3-part naming for cross-engine reads. Materialize across the boundary only when workload metrics — not assumptions — tell you to. Remember: virtualization shifts CU cost to query-time; materialization front-loads ingestion and storage so repeated reads are cheaper and more predictable. Let your actual usage patterns decide.

Quick checklist: production-hardening the Spark ↔ Warehouse boundary

Make the handoff explicit (a specific Files path or a specific Lakehouse table).
Version your schema (breaking changes should be intentional and tested).
Avoid singleton inserts into Warehouse; prefer bulk patterns (CTAS, INSERT…SELECT).
Validate row counts and freshness after each load (and alert on drift).
Treat Delta log publishing as eventual across engines; design your BI/ML expectations accordingly.

Summary

Fabric is at its best when you let each engine do what it’s good at:

Spark for transformation, enrichment, and complex data engineering logic.
Warehouse for the curated serving model and SQL-first consumers.

OneLake + Delta is the glue. Start with 3-part naming for zero-copy interoperability across engines, and materialize only when workload evidence justifies the extra storage and ingestion cost. That way you get the simplicity of one logical data layer without paying for copies you don’t need.

This post was written with help from Opus 4.6

References

Delta Lake table format interoperability (Microsoft Learn)
Delta Lake Logs in Warehouse (Microsoft Learn)
SQL analytics endpoint of the Lakehouse (Microsoft Learn)
Cross-database querying in Fabric Warehouse (Microsoft Learn)
Ingest data into the Warehouse (Microsoft Learn)
Browse file content with OPENROWSET (Microsoft Learn)
OneLake as a source for COPY INTO and OPENROWSET (Preview) (Microsoft Fabric Blog)

What SQL database in Fabric actually means for your Spark pipelines

There is a particular kind of excitement that sweeps through data engineering teams when Microsoft announces a new database option. It is the same mixture of curiosity and low-grade dread you might feel upon learning that your neighborhood is getting a new highway interchange. Useful, probably. Disruptive, definitely. Someone is going to have to figure out the on-ramps.

SQL database in Fabric went generally available in November 2025. Built on the same SQL Database Engine that powers Azure SQL Database, it is the first fully SaaS-native operational database living inside Microsoft Fabric. More than 50,000 SQL databases were created during the preview period alone. If you spend your days writing Spark notebooks, building lakehouses, and tending ETL pipelines, this thing will change how you work whether you plan for it or not.

Here is what you need to know, what you should actually do about it, and where the potholes are hiding.

Your operational data now lands in OneLake automatically

The headline feature for Spark teams is automatic replication to OneLake. When data gets written to a SQL database in Fabric, it mirrors to OneLake as Delta tables in near real-time. No pipelines. No connectors. No orchestration jobs that fail silently at 2 AM and ruin your Monday.

This sounds almost too convenient, and in some ways it is. The mirrored Delta tables arrive in an open format your Spark notebooks can read directly. You point a DataFrame at the mirrored location, run your transformations, and push results to your gold layer without ever having written an ingestion pipeline for that source.

If your team currently runs nightly batch loads from Azure SQL or SQL Server into a lakehouse, this is a real shift. That entire category of extract-and-load work can shrink or vanish. But “can” is doing heavy lifting in that sentence, and we need to talk about why.

How this changes daily Spark development

The practical impact shows up in a few specific places.

Reading operational data gets simpler. Instead of maintaining JDBC connections, managing credential rotation, and writing Spark code to pull from SQL, you read Delta tables from OneLake. The data is already there. Your Spark cluster does not need network access to the SQL database itself. One fewer firewall rule, one fewer connection string in your key vault, one fewer thing that breaks when someone rotates a password on a Friday afternoon.

Schema changes arrive faster than you can react. With batch ETL, you had a buffer. The pipeline would fail, someone would get an alert, and you had time to adapt your downstream notebooks. Near real-time mirroring removes that cushion. A column rename or type change in the operational database shows up in your Delta tables within seconds to minutes. If your Spark jobs reference columns by name (they do), you need schema evolution handling that most teams have not built yet.

Think about what happens when a developer on the application side renames customer_id to cust_id on a Wednesday afternoon. Your batch pipeline would have failed that night, you would have caught it Thursday morning, and the fix would be a one-line column alias. With mirroring, your running Spark job gets a AnalysisException: cannot resolve 'customer_id' mid-stream. The fix is the same, but the timing is worse.

SQL users can now query your lakehouse data directly. SQL database in Fabric supports OPENROWSET and External Tables for querying OneLake data in CSV, Parquet, and JSON formats. Your SQL-writing colleagues can query lakehouse data without Spark. That sounds like a collaboration win until a SQL user runs a full table scan on your carefully partitioned Parquet files and you both learn something new about capacity throttling.

Establish clear ownership of shared datasets early. Document which OneLake paths are read-safe for SQL access and which ones carry performance risk.

The SQL Analytics Endpoint changes reporting paths. Every SQL database in Fabric gets a SQL Analytics Endpoint that sits on top of the mirrored data. Power BI can hit this endpoint with Direct Lake, which means your Spark team might no longer be in the critical path for building reporting datasets. If you have spent months building and maintaining a medallion architecture primarily to serve Power BI, parts of that effort become optional. Whether that feels like relief or irrelevance depends on your org chart.

Migration risks worth planning for

Before you start ripping out pipelines, here are the things that deserve a red flag on your project board.

Capacity billing is shared, and the math is unforgiving. SQL database in Fabric consumes the same Fabric capacity as your Spark jobs, warehouses, and Power BI refreshes. If someone provisions a heavily used SQL database on the same capacity where your Spark notebooks run, you will feel it. Fabric capacity is a zero-sum game. The new player at the table did not bring extra chips.

Run a two-week trial on a dedicated capacity before mixing SQL database workloads with existing Spark production. Use the Microsoft Fabric Capacity Metrics App to understand exactly how many CUs the database consumes at rest and under load.

Near real-time is not real-time, and the gap varies. The mirroring latency depends on transaction volume and capacity pressure. Under light load, changes appear in seconds. Under heavy load on a congested capacity, you might see minutes of lag. If your Spark pipelines assume data completeness at a specific watermark, you need to measure actual replication lag under realistic conditions. A simple row-count comparison between the SQL database and the mirrored Delta table, run every five minutes for a week, will tell you more than any documentation.

Security boundaries do not mirror perfectly. SQL database in Fabric supports Microsoft Entra authentication, row-level security, customer-managed keys, and SQL auditing (in preview). Your lakehouse uses OneLake RBAC, workspace roles, and Spark-level access controls. The mirrored data inherits some but not all of these boundaries. Row-level security in the SQL database, for instance, does not automatically apply to the mirrored Delta table in OneLake. If you have sensitive columns, verify the access controls on the mirror before your entire data team has read access.

Vendor lock-in compounds quietly. Every pipeline you remove and every JDBC connector you delete makes you more dependent on Fabric-internal mechanisms. If you later need to run Spark on Databricks, on EMR, or on bare-metal clusters, your data ingestion path disappears. This is not a reason to avoid the feature, but it is a reason to document what you replaced and keep a migration playbook somewhere that is not a Confluence page nobody remembers exists.

A rollout checklist for Spark teams

If you are ready to start integrating SQL database in Fabric into your data engineering stack, here is a practical sequence.

Inventory your SQL-sourced pipelines. List every Spark job that reads from Azure SQL, SQL Server, or any SQL-based source via JDBC, linked services, or copy activities. Note the refresh frequency, data volume, and downstream dependencies. If you cannot produce this list in under an hour, that is itself a useful finding.
Provision a SQL database in Fabric on a non-production capacity. Do not test this on production. Capacity contention is real, and you want to understand billing impact before it appears on someone else’s finance report.
Mirror a single non-critical table and validate. Pick a reference table, something small and stable. Confirm the Delta table lands in OneLake, check the schema, verify column types, and read it from a Spark notebook. Compare row counts and checksums against the source.
Measure replication lag under real load. Insert, update, and delete rows in the SQL database and time how quickly those changes appear in the mirrored Delta table. Run this test during your normal capacity utilization window, not during off-hours when capacity is idle and results are misleadingly fast.
Test schema evolution deliberately. Add a column. Rename a column. Change a data type. Observe what happens to the mirrored Delta table and to any Spark jobs reading it. Build your error handling before this surprises you in production.
Audit security boundaries on the mirror. Check whether row-level security, column masking, or other access controls in the SQL database are reflected in the mirrored OneLake data. Document gaps and decide whether they are acceptable for your data classification. If they are not, add a data masking step between the mirror and your Spark consumers.
Run a cost comparison over two weeks. Compare the Fabric capacity consumption of the SQL database plus mirroring against your current pipeline compute costs. Include the engineering time saved, but be honest. “We saved two hours a month of pipeline maintenance” is a real number. “We saved countless engineering hours” is not.
Deprecate one pipeline as a pilot. Pick your simplest SQL-sourced pipeline, redirect the downstream Spark job to read from the mirrored Delta table, and run both paths in parallel for at least two sprints. When you are confident, decommission the old pipeline and update your runbooks.

Vector search: a side door into AI workloads

SQL database in Fabric supports the native vector data type and vector indexing. This opens up retrieval-augmented generation (RAG) patterns directly inside the database, without adding a separate vector store to your architecture.

For Spark teams building ML pipelines or feeding large language models, the value is in co-location. You can store embeddings alongside your operational data, run similarity searches in SQL, and then access the same data from Spark for model training or batch inference. A product catalog with embeddings stored as vectors in SQL can serve both a real-time search API and a nightly Spark training job without data duplication.

This will not replace Pinecone or Weaviate for teams running high-throughput similarity search at scale. But for teams running modest-scale RAG or semantic search against operational data, it removes one service from the architecture and one deployment from the on-call rotation. That is not nothing.

What to expect next

Microsoft has made it clear that SQL database in Fabric is part of a longer play to bring operational data fully into the Fabric ecosystem. The integration with Copilot in the Query Editor, support for Terraform and Fabric CLI automation, and the first-ever SQLCon conference co-located with FabCon Atlanta in March 2026 all point the same direction: the wall between transactional and analytical workloads is getting thinner.

For Spark data engineering teams, the right move is not to panic and rewrite everything. It is to understand the mechanics, run a controlled test, and make deliberate decisions about which pipelines to retire and which to keep. The highway interchange is open. You just need to figure out your on-ramp.

This post was written with help from Opus 4.6

The Best Thing That Ever Happened to Your Spark Pipeline Is a SQL Database

Here’s a counterintuitive claim: the most important announcement for Fabric Spark teams in early 2026 has nothing to do with Spark.

It’s a SQL database.

Specifically, it’s the rapid adoption of SQL database in Microsoft Fabric—a fully managed, SaaS-native transactional database that went GA in November 2025 and has been quietly reshaping how production data flows into lakehouse architectures ever since. If you’re a data engineer running Spark workloads in Fabric, this changes more than you think.

The ETL Pipeline You Can Delete

Most Spark data engineers have a familiar pain point: getting operational data from transactional systems into the lakehouse. You build ingestion pipelines. You schedule nightly batch loads. You wrestle with CDC (change data capture) configurations, watermark columns, and retry logic. You maintain all of it, forever.

SQL database in Fabric eliminates that entire layer.

When data lands in a Fabric SQL database, it’s automatically replicated to OneLake as Delta tables in near real-time. No pipelines. No Spark ingestion jobs. No orchestration. The data just appears, already in the open Delta format your notebooks and Spark jobs expect.

This isn’t a minor convenience—it’s an architectural shift. Every ingestion pipeline you don’t write is a pipeline you don’t debug at 2 AM.

What This Actually Looks Like in Practice

Let’s say you’re building an analytics layer on top of an operational SaaS application. Today, your architecture probably looks something like this:

Application writes to Azure SQL or Cosmos DB
ADF or Spark job pulls data on a schedule
Data lands in a lakehouse as Delta tables
Downstream Spark jobs transform and aggregate

With SQL database in Fabric, steps 2 and 3 vanish. Your application writes directly to the Fabric SQL database, and the mirrored Delta tables are immediately available for Spark processing. Here’s what your downstream notebook looks like now:

# Read operational data directly — no ingestion pipeline needed # The SQL database auto-mirrors to OneLake as Delta tables orders_df = spark.read.format("delta").load(     "abfss://your-workspace@onelake.dfs.fabric.microsoft.com/your-sqldb.SQLDatabase/dbo.Orders" )  # Your transformation logic stays the same from pyspark.sql import functions as F  daily_revenue = (     orders_df     .filter(F.col("order_date") >= F.date_sub(F.current_date(), 7))     .groupBy("product_category")     .agg(         F.sum("total_amount").alias("revenue"),         F.countDistinct("customer_id").alias("unique_customers")     )     .orderBy(F.desc("revenue")) )  daily_revenue.write.format("delta").mode("overwrite").saveAsTable("gold.weekly_revenue_by_category")

The Spark code doesn’t change. What changes is everything upstream of it.

The Migration Risk Nobody’s Talking About

Here’s where it gets interesting—and where Malcolm Gladwell would lean forward in his chair. The biggest risk of SQL database in Fabric isn’t technical. It’s organizational.

Teams that have invested heavily in ingestion infrastructure will face a classic innovator’s dilemma: the new path is simpler, but the old path already works. The temptation is to keep running your existing ADF pipelines alongside the new auto-mirroring capability, creating a hybrid architecture that’s worse than either approach alone.

My recommendation: don’t hybrid. Pick a workload, migrate it end-to-end, and measure. Here’s a concrete rollout checklist:

Identify a candidate workload — Look for Spark jobs whose primary purpose is pulling data from a SQL source into Delta tables. These are your highest-value migration targets.
Provision a Fabric SQL database — It takes seconds. You provide a name; Fabric handles the rest. Autoscaling and auto-pause are built in.
Redirect your application writes — Point your operational application to the new Fabric SQL database. The engine is the same SQL Database Engine as Azure SQL, so T-SQL compatibility is high.
Validate the Delta mirror — Confirm that your data is appearing in OneLake. Check schema fidelity, latency, and row counts:

# In your Spark notebook, validate the mirrored data spark.sql("""     SELECT COUNT(*) as row_count,            MAX(modified_date) as latest_record,            MIN(modified_date) as earliest_record     FROM your_sqldb.dbo.Orders """).show()

Decommission the ingestion pipeline — Once validated, turn off the ADF or Spark ingestion job. Don’t just disable it—delete it. Zombie pipelines are how technical debt accumulates.
Update your monitoring — Your existing data quality checks should still work since the Delta tables have the same schema. But update your alerting to watch for mirror latency instead of pipeline run failures.

The AI Angle Matters for Spark Teams Too

There’s a second dimension to this announcement that Spark engineers should pay attention to: the native vector data type in SQL database supports semantic search and RAG patterns directly in the transactional layer.

Why does that matter for Spark teams? Because it means your embedding pipelines can write vectors back to the same database your application reads from—closing the loop between batch ML processing in Spark and real-time serving in SQL. Instead of maintaining a separate vector store (Pinecone, Qdrant, etc.), you use the same SQL database that’s already mirrored into your lakehouse.

This is the kind of architectural simplification that compounds over time. Fewer systems means fewer failure modes, fewer credentials to manage, and fewer things to explain to your successor.

The Rollout Checklist

This week: Inventory your existing ingestion pipelines. How many just move data from SQL sources to Delta?
This sprint: Provision a Fabric SQL database and test the auto-mirror with a non-critical workload.
This quarter: Migrate your highest-volume ingestion pipeline and measure CU savings.
Track: Mirror latency, CU consumption before/after, and pipeline maintenance hours eliminated.

SQL database in Fabric went GA in November 2025 with enterprise features including row-level security, customer-managed keys, and private endpoints. For the full list of GA capabilities, see the official announcement. To understand how this fits into the broader Microsoft database + Fabric integration strategy, read Microsoft Databases and Microsoft Fabric: Your unified and AI-powered data estate. For Spark-specific Delta Lake concepts, the Delta Lake documentation remains the authoritative reference.

The best thing about this announcement isn’t any single feature. It’s that it makes your Spark architecture simpler by removing the parts that shouldn’t have been there in the first place.

This post was written with help from Claude Opus 4.6

OneLake catalog in Microsoft Fabric: Explore, Govern, and Secure

If your Fabric tenant has grown past “a handful of workspaces,” the problem isn’t just storage or compute—it’s finding the right items, understanding what they are, and making governance actionable.

That’s the motivation behind the OneLake catalog: a central hub to discover and manage Fabric content, with dedicated experiences for discovery (Explore), governance posture (Govern), and security administration (Secure).

This post is a practical walk-through of what’s available today, with extra focus on what Fabric admins get in the Govern experience.

What is the OneLake catalog?

Microsoft describes the OneLake catalog as a centralized place to find, explore, and use Fabric items—and to govern the data you own.

You open it from the Fabric navigation pane by selecting the OneLake icon.

Explore tab: tenant-wide discovery without losing context

The Explore tab is the “inventory + details” experience:

An items list of Fabric content you can access (and in some cases, content you can request access to).
An in-context details pane so you can inspect an item without navigating away from your filtered list.
Filters and selectors to narrow scope (for example: workspace, item-type categories, endorsement, and tags).

A key pattern here is fast triage: filter down to a domain/workspace, then click through items to answer:

Who owns this?
Where does it live?
When was it refreshed?
Is it endorsed/certified?
Does it have sensitivity labeling?

Tip for data engineers

If your tenant uses domains, scoping the catalog to a domain/subdomain is often the quickest way to keep the item list meaningful—especially when teams create similar notebooks/pipelines across many workspaces.

Govern tab: governance posture + recommended actions

The Govern tab is where the catalog becomes more than “a directory.” It combines:

Insights (high-level indicators you can drill into)
Recommended actions (with step-by-step remediation guidance)
Links to relevant tools and learning resources

Admin view vs. data owner view

The Govern tab behaves differently depending on who you are:

Fabric admins see insights based on tenant metadata (items, workspaces, capacities, domains).
Data owners see insights scoped to items they own (using the My items concept).

The Fabric blog also calls out a preview experience that extends the OneLake catalog governance view for Fabric admins, providing consolidated indicators and deeper drill-down reporting.

What admins see on the Govern tab

From the Fabric admin perspective, the Govern experience is designed to answer:

What does our data estate look like (inventory, distribution, usage)?
Where are we under-labeled or non-compliant (sensitivity coverage, policy posture)?
What content is hard to trust or reuse (freshness, endorsement/description/tag coverage, sharing patterns)?

When admins choose View more, Learn documentation describes an expanded report with three areas:

Manage your data estate (inventory, capacities/domains, feature usage)
Protect, secure & comply (sensitivity label coverage and data loss prevention policy posture)
Discover, trust, and reuse (freshness, curation signals such as endorsement/description coverage, sharing)

A detail worth knowing: refresh cadence differs for admins

Per Microsoft Learn, admin insights and actions are based on Admin Monitoring Storage data and refresh automatically every day, so there can be a lag between changes you make and what the Govern insights reflect.

Secure tab: centralized security role management

The OneLake catalog Secure tab is a security administration surface that centralizes:

Workspace roles and permissions (for auditing access)
OneLake security roles across workspaces and item types

From the Secure tab, admins can create, edit, or delete OneLake security roles from a single location.

A practical workflow to adopt (teams + admins)

Here’s a lightweight approach that scales better than “ask around on Teams”:

Explore: Use domain/workspace scoping + filters to find candidate items.
Inspect: Use the in-context details pane to sanity-check ownership, endorsement, sensitivity, and freshness.
Govern: Use the recommended actions cards to drive a small number of measurable improvements:
- increase sensitivity label coverage
- improve endorsement/certification where appropriate
- standardize descriptions/tags for key assets
Secure: Audit role sprawl and standardize how OneLake security roles are managed across items.

Considerations and limitations to keep in mind

A few constraints called out in Learn documentation (useful when you’re setting expectations):

The Govern tab doesn’t support cross-tenant scenarios or guest users.
The Govern tab isn’t available when Private Link is activated.
Govern insights for admins can be up to a day behind due to daily refresh of admin monitoring storage.

References

This post was written with help from ChatGPT 5.2

OneLake Shortcuts + Spark: Practical Patterns for a Single Virtual Lakehouse

If you’ve adopted Microsoft Fabric, there’s a good chance you’re trying to reduce the number of ‘copies’ of data that exist just so different teams and engines can access it.

OneLake shortcuts are one of the core primitives Fabric provides to unify data across domains, clouds, and accounts by making OneLake a single virtual data lake namespace.

For Spark users specifically, the big win is that shortcuts appear as folders in OneLake—so Spark can read them like any other folder—and Delta-format shortcuts in the Lakehouse Tables area can be surfaced as tables.

What a OneLake shortcut is (and isn’t)

A shortcut is an object in OneLake that points to another storage location (internal or external to OneLake).

Shortcuts appear as folders and behave like symbolic links: deleting a shortcut doesn’t delete the target, but moving/renaming/deleting the target can break the shortcut.

From an engineering standpoint, that means you should treat shortcuts as a namespace mapping layer—not as a durability mechanism.

Where you can create shortcuts: Lakehouse Tables vs Files

In a Lakehouse, you create shortcuts either under the top-level Tables folder or anywhere under the Files folder.

Tables has constraints: OneLake doesn’t support shortcuts in subdirectories of the Tables folder, and shortcuts in Tables are typically meant for targets that conform to the Delta table format.

Files is flexible: there are no restrictions on where you can create shortcuts in the Files hierarchy, and table discovery does not happen there.

If a shortcut in the Tables area points to Delta-format data, the lakehouse can synchronize metadata and recognize the folder as a table.

One documented gotcha: the Delta format doesn’t support table names with space characters, and OneLake won’t recognize any shortcut containing a space in the name as a Delta table.

How Spark reads from shortcuts

In notebooks and Spark jobs, shortcuts appear as folders in OneLake, and Spark can read them like any other folder.

For table-shaped data, Fabric automatically recognizes shortcuts in the Tables section of the lakehouse that have Delta/Parquet data as tables—so you can reference them directly from Spark.

Microsoft Learn also notes you can use relative file paths to read data directly from shortcuts, and Delta shortcuts in Tables can be read via Spark SQL syntax.

Practical patterns (what I recommend in real projects)

Pattern 1: Use Tables shortcuts for shared Delta tables you want to show up consistently across Fabric engines (Spark + SQL + Direct Lake scenarios via semantic models reading from shortcuts).

Pattern 2: Use Files shortcuts when you need arbitrary formats or hierarchical layouts (CSV/JSON/images, nested partitions, etc.) and you’re fine treating it as file access.

Pattern 3: Prefer shortcuts over copying/staging when your primary goal is to eliminate edge copies and reduce latency from data duplication workflows.

Pattern 4: When you’re operationalizing Spark notebooks, make the access path explicit and stable by using the shortcut path (the place it appears) rather than hard-coding a target path that might change.

Operational gotchas and guardrails

Because moving/renaming/deleting a target path can break a shortcut, add lightweight monitoring for “broken shortcut” failures in your pipelines (and treat them like dependency failures).

For debugging, the lakehouse UI can show the ABFS path or URL for a shortcut in its Properties pane, which you can copy for inspection or troubleshooting.

Outside of Fabric, services can access OneLake through the OneLake API, which supports a subset of ADLS Gen2 and Blob storage APIs.

Summary

Shortcuts give Spark a clean way to treat OneLake like a unified namespace: read shortcuts as folders, surface Delta/Parquet data in Tables as tables, and keep your project’s logical paths stable even when physical storage locations vary.

References

Unify data sources with OneLake shortcuts (Microsoft Learn)
Access OneLake shortcuts in an Apache Spark notebook (Microsoft Learn)
OneLake access and APIs (Microsoft Learn)

This post was written with help from ChatGPT 5.2

Unveiling Microsoft OneLake: A Unified Intelligent Data Foundation

Microsoft recently introduced OneLake, a part of Microsoft Fabric, designed to accelerate data potential for the era of AI. One Lake provides a unified intelligent data foundation for all analytic workloads, integrating Power BI, Data Factory, and the next generation of Synapse. This solution offers customers a high-performing and easy-to-manage modern analytics solution.

OneLake: The OneDrive for All Your Data

OneLake provides a single data lake for your entire organization. For every Fabric tenant, there will always be exactly one OneLake, never two, never zero. There is no infrastructure to manage or set up. The concept of a tenant is a unique benefit of a SaaS service. It allows Microsoft to automatically provide a single management and governance boundary for the entire organization, which is ultimately under the control of a tenant admin.

Breaking down Data Silos with OneLake

OneLake aims to provide a data lake as a service without you needing to build it yourself. It enables different business groups to work independently without going through a central gatekeeper. Different workspaces allow different parts of the organization to work independently while still contributing to the same data lake. Each workspace can have its own administrator, access control, region, and capacity for billing.

OneLake: Spanning the Globe

OneLake covers this by spanning the globe as well. Different workspaces can reside in different regions. This means that any data stored in those workspaces will also reside in those countries. OneLake is built on top of Azure Data Lake Storage Gen2 under the covers. It will use multiple storage accounts in different regions, however, OneLake will virtualize them into one logical lake.

OneLake: Open Data Lake

OneLake is not just a Fabric data lake or a Microsoft data lake, it is an open data lake. In addition to being built on ADLS Gen2, OneLake supports the same ADLS Gen2 APIs and SDKs, making it compatible with existing ADLs applications, including Azure Databricks and Azure HDInsights.

OneLake: One Copy

OneLake with One Copy aims to get the most value possible out of a single copy of data without data movement or duplication. It allows data to be virtualized into a single data product without data movement, data duplication, or changing the ownership of the data.

OneLake: One Security

One Security is a feature in active development that aims to let you secure the data once and use it anywhere. One Security will bring a shared universal security model which you will define in OneLake. These security definitions will live alongside the data itself. This is an important detail. Security will live with the data rather than living downstream in the serving or presentation layers.

OneLake Data Hub

The OneLake Data Hub is the central location within Fabric to discover, manage, and reuse data. It serves all users from data engineer to business user. Data can easily be discovered by its domain, for example, Finance, HR, or Sales, so users find what actually matters to them.

In conclusion, OneLake is a game-changer in the world of data management and analytics. It provides a unified, intelligent data foundation that breaks down data silos, enabling organizations to harness the full potential of their data in the era of AI.

This blogpost was created with help from ChatGPT Pro.

Open Mirroring + OneLake: Spark patterns that keep latency from eating your weekends

first principle: mirrored is a landing zone, not a serving layer

what actually causes the latency cliff

1) file layout drift (the silent killer)

2) schema drift (the 3 a.m. pager)

lag is real, even when everything is healthy

the production checklist (use this before go-live)

architecture that holds up when volume gets ugly

Share this:

What the February 2026 Fabric Influencers Spotlight means for your Spark team

Signal 1: Get your production code out of notebooks

Signal 2: Direct Lake changes how you write to your lakehouse

Signal 3: CI/CD for Fabric just got real

The deeper pattern

Your rollout checklist for March

Share this:

How Spark, OneLake, and mirroring connect (and where they don’t)

Detection signals you actually need

Failure mode 1: the Spark write that succeeds but isn’t queryable yet

Failure mode 2: mirroring goes quiet

Failure mode 3: shortcut permissions drift

Failure mode 4: capacity throttling disguised as five different problems

Assembling the runbook

Share this:

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

What the connector actually is

Reading: the part that mostly just works

Writing: the part with surprises

How writes actually work under the hood

Save modes

The timestamp_ntz gotcha

What you can’t write to

Private Link limitations

Time Travel is not supported

When to use Warehouse vs. Lakehouse as your serving layer

A concrete pattern: Spark ETL → Warehouse serving layer

Cross-database queries: the glue between them

Performance notes from the field

The honest summary

Share this:

Mental model: OneLake is the handshake

Start here: 3-Part Naming over the SQL Endpoint

When this is enough (and it often is)

When to materialize instead

The CU tradeoff

Pattern 1 — Spark → Warehouse via OneLake files (COPY INTO + OPENROWSET)

When to use it

Step 1: Write a “handoff” dataset from Spark

Step 2: Inspect the file shape from the Warehouse (OPENROWSET)

Step 3: Ingest into a Warehouse table (COPY INTO)

Operational guardrails

Pattern 2 – Spark → Warehouse via in-workspace tables (CTAS / INSERT…SELECT)

When to use it

Step 1: Produce a curated Delta table with Spark

Step 2: Materialize a Warehouse table from the Lakehouse table

Pattern 3 – Warehouse → Spark via published Delta logs (read-only)

Step 1: Get the OneLake path for a Warehouse table

Step 2: Read the Warehouse table from Spark (read-only)

Bonus control: pause Delta log publishing

Choosing an ownership model (so you don’t end up with two sources of truth)

Quick checklist: production-hardening the Spark ↔ Warehouse boundary

Summary

References

Share this:

Your operational data now lands in OneLake automatically

How this changes daily Spark development

Migration risks worth planning for

A rollout checklist for Spark teams

Vector search: a side door into AI workloads

What to expect next

Share this:

The ETL Pipeline You Can Delete

What This Actually Looks Like in Practice

The Migration Risk Nobody’s Talking About

The AI Angle Matters for Spark Teams Too

The Rollout Checklist

Share this:

What is the OneLake catalog?

Explore tab: tenant-wide discovery without losing context

Tip for data engineers