schema drift – Christopher Finlan

Open Mirroring + OneLake: Spark patterns that keep latency from eating your weekends

Dev is clean. Prod is chaos. In dev, your mirrored table has a cute little dataset and Spark tears through it. In prod, that same notebook starts wheezing like it ran a marathon in wet jeans.

If that sounds familiar, good. You’re not cursed. You’re running into architecture debt that Open Mirroring does not solve for you.

Open Mirroring in Microsoft Fabric does exactly what it says on the tin: it replicates data from external systems into OneLake as Delta tables, and schema changes in the source can flow through. That’s huge. It cuts out a pile of ingestion plumbing.

But mirroring only lands data. It does not guarantee your Spark reads will be fast, stable, or predictable. That’s your job.

This post is the practical playbook: what breaks, why it breaks, and the patterns that keep your Spark jobs from turning into slow-motion disasters.

first principle: mirrored is a landing zone, not a serving layer

Treat mirrored tables like an airport runway. Planes touch down there. People do not set up a picnic on the tarmac.

When teams read mirrored tables directly in hot-path jobs, they inherit whatever file layout the connector produced. Sometimes that layout is fine. Sometimes it is a junk drawer.

Spark is sensitive to this. Reading many tiny files creates scheduling and metadata overhead. Reading a few huge files kills parallelism. Either way, the cluster burns time doing the wrong work.

The fix is boring and absolutely worth it: add a curated read layer.

Let Open Mirroring write into a dedicated mirror lakehouse.
Run a post-mirror notebook that reshapes data for Spark (partitioning, compaction, cleanup).
Have production notebooks read curated tables only.

One extra hop. Much better nights of sleep.

what actually causes the latency cliff

Two things usually punch you in the face at scale:

File layout drift
Schema drift

Let’s tackle them in order.

1) file layout drift (the silent killer)

Spark scheduling is roughly file-driven for Parquet/Delta scans. That means file shape becomes execution shape. If your table has wildly uneven files, your job speed is set by the stragglers.

Think of ten checkout lanes where nine customers have one item and one customer has a full garage sale cart. Everyone waits on that last lane.

Start by measuring file distribution, not just row counts.

from pyspark.sql import functions as F  # NOTE: inputFiles() returns a Python list of file paths df = spark.read.format("delta").load("Tables/raw_mirrored_orders") paths = df.inputFiles()  # Use Hadoop FS to get file sizes in bytes jvm = spark._jvm conf = spark._jsc.hadoopConfiguration() fs = jvm.org.apache.hadoop.fs.FileSystem.get(conf)  sizes = [] for p in paths:     size = fs.getFileStatus(jvm.org.apache.hadoop.fs.Path(p)).getLen()     sizes.append((p, size))  size_df = spark.createDataFrame(sizes, ["path", "size_bytes"])  size_df.select(     F.count("*").alias("file_count"),     F.round(F.avg("size_bytes")/1024/1024, 2).alias("avg_mb"),     F.round(F.expr("percentile_approx(size_bytes, 0.5)")/1024/1024, 2).alias("p50_mb"),     F.round(F.expr("percentile_approx(size_bytes, 0.9)")/1024/1024, 2).alias("p90_mb"),     F.round(F.max("size_bytes")/1024/1024, 2).alias("max_mb") ).show(truncate=False)

You want a tight-ish band, not chaos. A common rule of thumb is targeting roughly 128 MB to 512 MB Parquet files for balanced throughput and parallelism. Rule of thumb, not religion. Your workload decides final tuning.

Then enforce a sane shape in curated tables:

raw = spark.read.format("delta").load("Tables/raw_mirrored_orders")  (raw.write     .format("delta")     .mode("overwrite")     .partitionBy("order_date")         # choose columns your queries actually filter on     .option("maxRecordsPerFile", 500000)     .save("Tables/curated_orders"))  spark.sql("OPTIMIZE delta.`Tables/curated_orders`")

If your queries filter by date and region, but you partition by customer_id because it “felt right,” you built a latency trap with your own hands.

2) schema drift (the 3 a.m. pager)

Open Mirroring can propagate source schema changes. That’s useful and dangerous.

Useful because your lake stays aligned. Dangerous because downstream logic often assumes a fixed shape.

A nullable column addition is usually fine. A type shift on a key metric column can quietly corrupt aggregations or explode at runtime.

Do not “notice this later.” Gate on it.

from pyspark.sql.types import StructType import json  # Store baseline schema as JSON in Files/schemas/orders_baseline.json with open("/lakehouse/default/Files/schemas/orders_baseline.json", "r") as f:     baseline = StructType.fromJson(json.load(f))  current = spark.read.format("delta").load("Tables/raw_mirrored_orders").schema  base = {f.name: str(f.dataType) for f in baseline.fields} curr = {f.name: str(f.dataType) for f in current.fields}  type_changes = [     f"{name}: {base[name]} -> {curr[name]}"     for name in curr     if name in base and base[name] != curr[name] ]  new_cols = [name for name in curr if name not in base]  if type_changes:     raise ValueError(f"Schema type changes detected: {type_changes}")  # Optional policy: allow new nullable columns but log them if new_cols:     print(f"New columns detected: {new_cols}")

Policy matters more than code here. Decide in advance what is auto-accepted versus what blocks the pipeline. Write it down. Enforce it every run.

lag is real, even when everything is healthy

Mirroring pipelines are replication systems, not teleportation devices. There is always some delay between source commit and mirrored availability. Sometimes tiny. Sometimes not.

If your job blindly processes “last hour” windows without checking mirror freshness, you’ll create holes and call them “data quality issues” three weeks later.

Add a freshness gate before transformations. The metadata source is connector-specific, but the pattern is universal:

from datetime import datetime, timedelta, timezone  # Example only: use the metadata table/view exposed by your mirroring setup last_mirror_ts = spark.sql("""   SELECT max(replication_commit_ts) as ts   FROM mirror_metadata.orders_status """).collect()[0]["ts"]  required_freshness = datetime.now(timezone.utc) - timedelta(minutes=15)  if last_mirror_ts is None or last_mirror_ts < required_freshness:     raise RuntimeError(         f"Mirror not fresh enough. Last commit: {last_mirror_ts}, required after: {required_freshness}"     )

No freshness, no run. That one line saves you from publishing confident nonsense.

the production checklist (use this before go-live)

Before promoting any mirrored-data Spark pipeline, run this checklist in the same capacity and schedule window as production:

File shape check
Measure file count and distribution (p50, p90, max).
If distribution is ugly, compact and rewrite in curated.
Partition sanity check
Confirm partitions match real filter predicates.
Use df.explain(True) and verify PartitionFilters is not empty for common queries.
Schema gate check
Compare current schema to baseline.
Fail on type changes unless explicitly approved.
Freshness gate check
Validate mirrored data is fresh enough for your downstream SLA.
Fail fast if not.
Throughput reality check
Time representative full and filtered scans from curated tables.
If runtime misses SLA, fix layout first, then tune compute.

If you only do one thing from this list, do the curated layer. Direct reads from mirrored tables are the root of most performance horror stories.

architecture that holds up when volume gets ugly

Keep it simple:

Mirror layer
Open Mirroring lands source data in OneLake Delta tables. This is your raw replica.
Curation job
A scheduled Spark notebook validates schema, reshapes partitions, and compacts files.
Curated layer
Downstream Spark notebooks and SQL consumers read here, not from mirror tables.
Freshness gate
Every downstream run checks replication freshness before processing.

That’s it. No heroics. No mystery knobs. Just a clean boundary between “data landed” and “data is ready to serve.”

Open Mirroring is genuinely powerful, but it is not magic. If you treat mirrored tables as production-ready serving tables, latency will eventually kneecap you. If you treat them as a landing zone and curate aggressively, Spark behaves, stakeholders stay calm, and your weekends stay yours.

This post was written with help from anthropic/claude-opus-4-6