
From CDC to lakehouse: making Fabric Eventstreams SQL survive contact with production Spark
Every data team eventually has the same bright idea: “Let’s do CDC so everything is real time.”
What follows is usually less bright.
Somebody wires up connectors, somebody else stands up Kafka, somebody definitely provisions a VM that nobody can later identify, and before long your “modern architecture” has one person who understands it, one person who is afraid of it, and one person who is on call for it. Usually the same person.
So yes, Fabric Eventstreams supporting native CDC connectors for Azure SQL, PostgreSQL, MySQL, and SQL Server sources matters. It removes a lot of plumbing work that used to be mandatory. More importantly, Eventstreams SQL can give you a place to interpret CDC events before they hit your lakehouse and Spark jobs.
That changes the shape of the problem. Not the existence of the problem. Just the shape.
And if you want this to run cleanly at 2:00 AM, the operational details matter more than the architecture diagram.
What Eventstreams SQL actually fixes
Raw CDC events are not analyst-friendly data. They are little envelopes full of intent and drama: insert, update, delete, before image, after image, metadata about the source transaction, and enough ambiguity to start arguments in code review.
If you ship those raw events downstream, every Spark notebook has to interpret them. That means duplicate merge logic and subtle differences between implementations. Two teams can read the same feed and produce slightly different answers. That is how trust in a data platform dies quietly.
Eventstreams SQL can resolve some of those semantics earlier. You can translate event-level changes into cleaner, ready-to-consume records before data lands in destinations.
That can be useful, but it is also where teams start sneaking business logic into the stream layer and then regretting it later.
The bigger question is not just where true merge logic belongs. It is where CDC interpretation belongs at all.
The merge logic decision you cannot avoid
You have two broad options:
- Push CDC interpretation upstream into Eventstreams SQL before landing.
- Treat Eventstream primarily as a transport layer, land raw or minimally altered CDC into staging, and resolve table semantics in the target engine.
I think option 2 is the better default.
Why? Because once you start doing meaningful CDC interpretation in the stream layer, you now have business logic living in the place that is hardest to reason about, hardest to test, and easiest to forget. You also make it much easier for different downstream systems to drift away from each other.
A cleaner pattern is:
- use Eventstream for ingestion, routing, and maybe very light filtering
- land into a staging layer
- let the target system own merge semantics
That means Azure SQL should own MERGE logic for Azure SQL targets. Lakehouse targets should use Spark or Delta MERGE INTO. The compute engine that owns the table should own the table semantics too.
Trying to make the stream layer do more than that is how teams end up with hidden logic, debugging hell, and architecture diagrams that look cleaner than the actual system.
One important caveat: Eventstreams SQL is not a substitute for Delta MERGE INTO on a Lakehouse table.
Checkpoints: boring, critical, and often broken by accident
Spark Structured Streaming checkpointing is one of those things everybody “knows” until a restart fails and nobody remembers how it works.
Checkpoint locations track stream progress. They are state, not decoration. They are tied to your query plan, and when you change schema or query structure, old checkpoint state may no longer be valid.
This is not an edge case. It is normal lifecycle behavior in evolving pipelines.
Three rules keep you out of trouble:
- Use distinct checkpoint paths per stream and per target table.
- Version checkpoint paths when query shape or schema changes.
- Watch lag between source offsets and committed checkpoint progress.
If you use one checkpoint path for multiple sinks, you are building future pain on purpose. If you change query shape without checkpoint versioning, restart failures are only a matter of timing.
Treat checkpoint migration as a cutover process. Track where old progress stopped, cut to a new checkpoint path intentionally, then retire the previous one once the new job is stable.
The small files problem is not glamorous, but it will hurt you
Most CDC pipelines do not fail dramatically. They fail by becoming slower each week until everyone pretends that 90 seconds is “pretty fast.”
Small files are often the culprit.
CDC streams produce frequent, small increments. Structured Streaming writes micro-batches. Direct lakehouse writes can also produce many tiny files depending on event cadence. Over time, table reads pay the cost in file listing and metadata overhead.
People love to ignore this because compaction feels like janitorial work. It is not. It is core performance engineering.
What works in practice:
- Repartition before write based on available Spark cores.
- Partition on-disk by ingestion date, and only add other partition keys when query patterns justify it.
- Do not partition by operation type. That creates tiny partitions and extra noise.
- Run regular OPTIMIZE jobs on high-volume CDC tables.
If you are writing through Spark, control file behavior with repartitioning and trigger cadence. A trigger(processingTime='30 seconds') or trigger(processingTime='2 minutes') can reduce file explosion compared with ultra-frequent batches.
If you are using direct Eventstreams-to-Lakehouse writes, accept that you are trading simplicity for less control and schedule compaction accordingly.
The exact maintenance workflow matters less than having one. One-off cleanup is fine when you are exploring, but scheduled maintenance is what keeps tables healthy over time.
Deletes: decide your philosophy before compliance decides for you
In CDC, inserts and updates are easy to reason about. Deletes are where architecture gets emotional.
For analytics, soft deletes are often the sane default: keep the row, mark is_deleted, add deleted_at, preserve history. This keeps downstream trend analysis and audit trails intact.
Hard deletes are different. If compliance requires physical removal, handle that intentionally, usually with batch logic that applies delete events against target Delta tables after landing.
A reliable pattern is:
- Stream all CDC events, including deletes, into staging.
- Run scheduled jobs that apply physical deletion rules to curated tables.
That keeps streaming simple and pushes irreversible operations into auditable, controllable execution windows.
Could you do something fancier? Probably. Should you, before you need to? Probably not.
Monitoring: minimum viable or maximum regret
A CDC pipeline with no alerting is just a suspense novel written in production.
Your baseline should cover four things:
- Stream health: is each Structured Streaming query active or terminated?
- Processing lag: how far are committed offsets behind source offsets?
- File accumulation: are table file counts growing faster than compaction can handle?
- Source silence: are you receiving events at all from CDC sources?
That last one matters because “no errors” does not mean “healthy.” If CDC gets disabled during maintenance, your pipeline can fail by producing nothing, which looks calm unless you explicitly monitor for inactivity windows.
Fabric Activator-based alerts can be useful for surfacing threshold breaches. Tie thresholds to actual SLAs, not vibes.
A practical starting playbook
If you are standing this up now, keep it simple:
- Enable CDC at the source (
sys.sp_cdc_enable_dbandsys.sp_cdc_enable_tablewhere applicable). - Validate flow end to end with one real table before scaling breadth.
- Segment tables early: simple merge logic in Eventstreams SQL, complex logic in Spark.
- Define checkpoint path standards before the first production deploy.
- Pick trigger intervals that balance latency with file quality.
- Schedule OPTIMIZE from day one, not after performance complaints.
- Document merge ownership per table so changes do not become archaeology.
None of this is exotic. That is exactly the point.
Good CDC architecture is usually not a story about cleverness. It is a story about disciplined boring decisions made early, then repeated consistently.
Final take
Fabric Eventstreams plus Spark can give teams a cleaner CDC path than the old connector-plus-consumer patchwork. Native CDC connectors can reduce integration grind. But I would still keep meaningful CDC interpretation and merge behavior in the target compute engine whenever possible. Spark Structured Streaming remains a practical choice for controlled writes and advanced merge behavior.
But the real success criteria are operational.
If you manage checkpoints like real state, control file growth before it controls you, choose a deliberate delete strategy, and wire up monitoring that catches silence as well as failure, this architecture can work well in production.
If you skip those details, it still works right up until the exact moment it doesn’t, which usually happens late, loud, and at the least convenient hour in human history.
That is less a Fabric problem than a production engineering problem. Fabric can simplify parts of the workflow, but it does not remove the need for operational discipline.
This post was written with help from anthropic/claude-opus-4-6








