What “Recent data” in Fabric means for Spark teams when time is the real bottleneck

At 8:07 a.m., nobody on a data engineering team is debating architecture purity. You’re trying to get back to the exact source you were fixing yesterday before another downstream notebook fails and somebody asks for an ETA.

That’s the problem Microsoft Fabric’s Recent data feature targets.

The feature landed in the February 2026 Fabric update and is currently in preview. It sounds small: Dataflow Gen2 remembers the specific items you used recently — tables, files, folders, databases, and sheets — and lets you load them directly into the editing canvas. For Spark-heavy teams, though, this is less of a UX tweak and more of a way to stop bleeding time in the first mile of work.

And yes, it’s still a preview feature. Treat it like a mountain route in unstable weather: useful, fast, and not something you trust blindly.

Why Spark teams should care about a Dataflow feature

A lot of Spark teams still frame Dataflow Gen2 as somebody else’s tool. That framing is outdated.

Dataflow Gen2 automatically creates staging Lakehouse and Warehouse items in your workspace. If your team’s workflow includes Dataflow-based ingestion and Spark-based transformation, the handoff between those steps is real. It’s your daily route.

Here’s the hard lesson: if your ingestion layer touches Dataflow Gen2, then UI friction inside Dataflow is your Spark team’s problem too.

What to do about it:

Write down your ingestion handoffs in plain language: source to Dataflow Gen2 to staging Lakehouse/Warehouse to Spark notebooks.
Mark where engineers repeatedly reconnect to the same sources. That’s where Recent data pays off first.

What Recent data changes under pressure

Recent data does one thing that matters: it remembers specific assets, not just abstract connections.

When you return to a fix, you’re not restarting the expedition from base camp. You get dropped closer to the problem. You can pull the item directly into the editing canvas and keep moving.

For teams, this changes the rhythm of incident response and iteration:

You get back to source-level corrections faster.
You reduce the chance that someone reconnects to the wrong similarly-named object while moving too fast.
You spend less team energy on navigation and more on data correctness.

None of this is glamorous. It’s also exactly where engineering throughput gets won.

Try this: during your next defect cycle, track one metric for a week — time from “issue found” to “source query/table reopened in Dataflow Gen2.” If that number drops after using Recent data, keep leaning in. If it doesn’t, your bottleneck is elsewhere.

What this feature doesn’t rescue you from

Teams love to over-credit new features. Recent data is a navigation accelerator. It’s not governance. It’s not validation. It’s not a replacement for naming discipline. And because it’s in preview, it’s not a foundation for critical operational assumptions.

If your source naming is chaotic, Recent data will surface chaos faster.

If your validation is weak, Recent data will help you ship mistakes sooner.

If your runbooks are vague, Recent data won’t magically teach new engineers what “correct” looks like.

Pair it with a minimum Spark validation pass after ingestion updates: schema check, null expectation, row-count sanity check. Keep this lightweight and repeatable. The point is fast feedback, not ceremony.

Preview discipline: run this like a survival checklist

Because Recent data is in preview, your team should operate with explicit guardrails.

Test in development first. Don’t roll workflow assumptions into production muscle memory before your team has used the feature in real edits.

Keep a source-of-truth map. Recent data is convenience. Your documented source map is control. Keep both.

Standardize names now. If a human can confuse two source objects at a glance, they will. Fix names before speed amplifies mistakes.

Define a fallback path. If the recent list doesn’t have what you need, nobody should improvise. Document the manual reconnect path and keep it current.

Review preview behavior monthly. If the feature behavior shifts while in preview, your team should notice fast and adjust intentionally. Assign one owner for “preview watch” each month. Their job: test the core flow, confirm assumptions still hold, alert the team if anything drifts.

The operating model for Spark leads

If you lead a Spark data engineering team, the decision is straightforward.

Use Recent data. Absolutely use it. But use it like a rope, not like wings.

A rope gets you through rough terrain faster when the team is clipped in, communicating, and following route discipline. Wings are what people imagine they have right before they step into empty air.

In practice:

Adopt the feature for speed.
Keep your documentation for continuity.
Keep naming conventions strict for safety.
Keep Spark-side validation for quality.
Treat preview status as a real risk signal, not legal fine print.

That combination is where this feature becomes meaningful. Not because it’s flashy. Because it removes repeated friction at exactly the point where your team loses focus, burns time, and compounds small mistakes.

In data engineering, the catastrophic failures usually start as tiny oversights repeated at scale. Recent data removes one class of those oversights — the constant re-navigation tax — but only if you wrap it in disciplined operating habits.

One less avoidable stumble on steep ground, so your team can spend its strength on the parts of the climb that actually require judgment.

This post was written with help from anthropic/claude-opus-4-6

From CDC to Lakehouse: Making Fabric Eventstreams SQL Survive Contact with Production Spark

Every data team eventually has the same bright idea: “Let’s do CDC so everything is real time.”

What follows is usually less bright.

Somebody wires up connectors, somebody else stands up Kafka, somebody definitely provisions a VM that nobody can later identify, and before long your “modern architecture” has one person who understands it, one person who is afraid of it, and one person who is on call for it. Usually the same person.

So yes, Fabric Eventstreams supporting native CDC connectors for Azure SQL, PostgreSQL, MySQL, and SQL Server sources matters. It removes a lot of plumbing work that used to be mandatory. More importantly, Eventstreams SQL can give you a place to interpret CDC events before they hit your lakehouse and Spark jobs.

That changes the shape of the problem. Not the existence of the problem. Just the shape.

And if you want this to run cleanly at 2:00 AM, the operational details matter more than the architecture diagram.

What Eventstreams SQL actually fixes

Raw CDC events are not analyst-friendly data. They are little envelopes full of intent and drama: insert, update, delete, before image, after image, metadata about the source transaction, and enough ambiguity to start arguments in code review.

If you ship those raw events downstream, every Spark notebook has to interpret them. That means duplicate merge logic and subtle differences between implementations. Two teams can read the same feed and produce slightly different answers. That is how trust in a data platform dies quietly.

Eventstreams SQL can resolve some of those semantics earlier. You can translate event-level changes into cleaner, ready-to-consume records before data lands in destinations.

That can be useful, but it is also where teams start sneaking business logic into the stream layer and then regretting it later.

The bigger question is not just where true merge logic belongs. It is where CDC interpretation belongs at all.

The merge logic decision you cannot avoid

You have two broad options:

Push CDC interpretation upstream into Eventstreams SQL before landing.
Treat Eventstream primarily as a transport layer, land raw or minimally altered CDC into staging, and resolve table semantics in the target engine.

I think option 2 is the better default.

Why? Because once you start doing meaningful CDC interpretation in the stream layer, you now have business logic living in the place that is hardest to reason about, hardest to test, and easiest to forget. You also make it much easier for different downstream systems to drift away from each other.

A cleaner pattern is:

use Eventstream for ingestion, routing, and maybe very light filtering
land into a staging layer
let the target system own merge semantics

That means Azure SQL should own MERGE logic for Azure SQL targets. Lakehouse targets should use Spark or Delta MERGE INTO. The compute engine that owns the table should own the table semantics too.

Trying to make the stream layer do more than that is how teams end up with hidden logic, debugging hell, and architecture diagrams that look cleaner than the actual system.

One important caveat: Eventstreams SQL is not a substitute for Delta MERGE INTO on a Lakehouse table.

Checkpoints: boring, critical, and often broken by accident

Spark Structured Streaming checkpointing is one of those things everybody “knows” until a restart fails and nobody remembers how it works.

Checkpoint locations track stream progress. They are state, not decoration. They are tied to your query plan, and when you change schema or query structure, old checkpoint state may no longer be valid.

This is not an edge case. It is normal lifecycle behavior in evolving pipelines.

Three rules keep you out of trouble:

Use distinct checkpoint paths per stream and per target table.
Version checkpoint paths when query shape or schema changes.
Watch lag between source offsets and committed checkpoint progress.

If you use one checkpoint path for multiple sinks, you are building future pain on purpose. If you change query shape without checkpoint versioning, restart failures are only a matter of timing.

Treat checkpoint migration as a cutover process. Track where old progress stopped, cut to a new checkpoint path intentionally, then retire the previous one once the new job is stable.

The small files problem is not glamorous, but it will hurt you

Most CDC pipelines do not fail dramatically. They fail by becoming slower each week until everyone pretends that 90 seconds is “pretty fast.”

Small files are often the culprit.

CDC streams produce frequent, small increments. Structured Streaming writes micro-batches. Direct lakehouse writes can also produce many tiny files depending on event cadence. Over time, table reads pay the cost in file listing and metadata overhead.

People love to ignore this because compaction feels like janitorial work. It is not. It is core performance engineering.

What works in practice:

Repartition before write based on available Spark cores.
Partition on-disk by ingestion date, and only add other partition keys when query patterns justify it.
Do not partition by operation type. That creates tiny partitions and extra noise.
Run regular OPTIMIZE jobs on high-volume CDC tables.

If you are writing through Spark, control file behavior with repartitioning and trigger cadence. A trigger(processingTime='30 seconds') or trigger(processingTime='2 minutes') can reduce file explosion compared with ultra-frequent batches.

If you are using direct Eventstreams-to-Lakehouse writes, accept that you are trading simplicity for less control and schedule compaction accordingly.

The exact maintenance workflow matters less than having one. One-off cleanup is fine when you are exploring, but scheduled maintenance is what keeps tables healthy over time.

Deletes: decide your philosophy before compliance decides for you

In CDC, inserts and updates are easy to reason about. Deletes are where architecture gets emotional.

For analytics, soft deletes are often the sane default: keep the row, mark is_deleted, add deleted_at, preserve history. This keeps downstream trend analysis and audit trails intact.

Hard deletes are different. If compliance requires physical removal, handle that intentionally, usually with batch logic that applies delete events against target Delta tables after landing.

A reliable pattern is:

Stream all CDC events, including deletes, into staging.
Run scheduled jobs that apply physical deletion rules to curated tables.

That keeps streaming simple and pushes irreversible operations into auditable, controllable execution windows.

Could you do something fancier? Probably. Should you, before you need to? Probably not.

Monitoring: minimum viable or maximum regret

A CDC pipeline with no alerting is just a suspense novel written in production.

Your baseline should cover four things:

Stream health: is each Structured Streaming query active or terminated?
Processing lag: how far are committed offsets behind source offsets?
File accumulation: are table file counts growing faster than compaction can handle?
Source silence: are you receiving events at all from CDC sources?

That last one matters because “no errors” does not mean “healthy.” If CDC gets disabled during maintenance, your pipeline can fail by producing nothing, which looks calm unless you explicitly monitor for inactivity windows.

Fabric Activator-based alerts can be useful for surfacing threshold breaches. Tie thresholds to actual SLAs, not vibes.

A practical starting playbook

If you are standing this up now, keep it simple:

Enable CDC at the source (sys.sp_cdc_enable_db and sys.sp_cdc_enable_table where applicable).
Validate flow end to end with one real table before scaling breadth.
Segment tables early: simple merge logic in Eventstreams SQL, complex logic in Spark.
Define checkpoint path standards before the first production deploy.
Pick trigger intervals that balance latency with file quality.
Schedule OPTIMIZE from day one, not after performance complaints.
Document merge ownership per table so changes do not become archaeology.

None of this is exotic. That is exactly the point.

Good CDC architecture is usually not a story about cleverness. It is a story about disciplined boring decisions made early, then repeated consistently.

Final take

Fabric Eventstreams plus Spark can give teams a cleaner CDC path than the old connector-plus-consumer patchwork. Native CDC connectors can reduce integration grind. But I would still keep meaningful CDC interpretation and merge behavior in the target compute engine whenever possible. Spark Structured Streaming remains a practical choice for controlled writes and advanced merge behavior.

But the real success criteria are operational.

If you manage checkpoints like real state, control file growth before it controls you, choose a deliberate delete strategy, and wire up monitoring that catches silence as well as failure, this architecture can work well in production.

If you skip those details, it still works right up until the exact moment it doesn’t, which usually happens late, loud, and at the least convenient hour in human history.

That is less a Fabric problem than a production engineering problem. Fabric can simplify parts of the workflow, but it does not remove the need for operational discipline.

This post was written with help from anthropic/claude-opus-4-6

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

There’s a connector that ships with every Fabric Spark runtime. It’s pre-installed. It requires no setup. And it lets your Spark notebooks read from—and write to—Fabric Data Warehouse tables as naturally as they read Delta tables from a Lakehouse.

Most Fabric Spark users don’t know it exists. The ones who do often run into the same three or four surprises. Let’s fix both problems.

What the connector actually is

The Spark connector for Fabric Data Warehouse (synapsesql) is a built-in extension to the Spark DataFrame API. It uses the TDS protocol to talk directly to the SQL engine behind your Warehouse (or the SQL analytics endpoint of a Lakehouse). You get read and write access to Warehouse tables from PySpark, Scala Spark, or Spark SQL.

One line of code to read:

from com.microsoft.spark.fabric.Constants import Constants  df = spark.read.synapsesql("my_warehouse.dbo.sales_fact")

One line to write:

df.write.mode("append").synapsesql("my_warehouse.dbo.sales_fact")

No connection strings. No passwords. No JDBC driver management. Authentication flows through Microsoft Entra—same identity you’re logged into your Fabric workspace with. The connector resolves the SQL endpoint automatically based on workspace context.

That’s the happy path. Now let’s talk about what actually happens when you use it.

Reading: the part that mostly just works

Reading from a Warehouse table into a Spark DataFrame is the connector’s strength. The synapsesql() call supports the full three-part naming convention: warehouse_name.schema_name.table_or_view_name. It works for tables and views, including views with joins across schemas.

A few things that are genuinely useful:

Predicate pushdown works. When you chain .filter() or .limit() onto your DataFrame, the connector pushes those constraints to the SQL engine. You’re not pulling the full table into Spark memory and then filtering—the SQL engine handles the filter and sends back the subset. This matters when your Warehouse tables have hundreds of millions of rows and you only need a time-sliced sample.

df = spark.read.synapsesql("my_warehouse.dbo.sales_fact") \     .filter("order_date >= '2026-01-01'") \     .select("order_id", "customer_id", "amount")

Cross-workspace reads work. If your Warehouse lives in a different workspace than your notebook’s attached Lakehouse, you pass the workspace ID:

df = spark.read \     .option(Constants.WorkspaceId, "<target-workspace-id>") \     .option(Constants.DatawarehouseId, "<warehouse-item-id>") \     .synapsesql("my_warehouse.dbo.sales_fact")

This is genuinely powerful for hub-and-spoke architectures where your curated Warehouse sits in a production workspace and your data science notebooks live in a sandbox workspace.

Parallel reads are available. For large tables, you can partition the read across multiple Spark tasks, similar to spark.read.jdbc:

df = spark.read \     .option("partitionColumn", "order_id") \     .option("lowerBound", 1) \     .option("upperBound", 10000000) \     .option("numPartitions", 8) \     .synapsesql("my_warehouse.dbo.sales_fact")

This splits the query into eight parallel reads, each fetching a range of order_id. Without this, you get a single-threaded read that will bottleneck on large tables.

Security models pass through. If your Warehouse has object-level security (OLS), row-level security (RLS), or column-level security (CLS), those policies are enforced when Spark reads the data. Your notebook sees exactly what your identity is authorized to see. This is a meaningful difference from reading Delta files directly via OneLake, where security operates at the workspace or folder level.

Custom T-SQL queries work too. You’re not limited to reading tables—you can pass arbitrary T-SQL:

df = spark.read \     .option(Constants.DatabaseName, "my_warehouse") \     .synapsesql("SELECT TOP 1000 * FROM dbo.sales_fact WHERE region = 'WEST'")

This is handy for complex aggregations or when you want the SQL engine to do the heavy lifting before data enters Spark.

Writing: the part with surprises

Write support for the Spark-to-Warehouse connector became generally available with Runtime 1.3. It works, and it solves a real architectural problem—but it has mechanics you need to understand.

How writes actually work under the hood

The connector uses a two-phase process:

Stage: Spark writes your DataFrame to intermediate Parquet files in a staging location.
Load: The connector issues a COPY INTO command, telling the Warehouse SQL engine to ingest from the staged files.

This is the same COPY INTO pattern that powers bulk ingestion into Fabric Data Warehouse generally. It’s optimized for throughput. It is not optimized for latency on small writes.

If you’re writing a DataFrame with 50 rows, the overhead of staging files and issuing COPY INTO means the write takes materially longer than you’d expect. For small, frequent writes, this connector is not the right tool. Use T-SQL INSERT statements through a SQL connection instead.

For batch writes of thousands to millions of rows, the connector performs well. The COPY INTO path is what the Warehouse was designed for.

Save modes

The connector supports four save modes:

errorifexists (default): Fails if the table already exists.
ignore: Silently skips the write if the table exists.
overwrite: Drops and recreates the table with new data.
append: Adds rows to the existing table.

df.write.mode("overwrite").synapsesql("my_warehouse.dbo.daily_aggregates")

A common pattern: Spark computes daily aggregations from Lakehouse Delta tables, then writes the results to a Warehouse table that Power BI reports connect to. The Warehouse’s result set caching (now generally available as of January 2026) means subsequent Power BI refreshes hit cache instead of recomputing.

The timestamp_ntz gotcha

This is the single most common error people hit when writing to a Warehouse from Spark.

If your DataFrame contains timestamp_ntz (timestamp without time zone) columns, the write will fail. Fabric Data Warehouse expects time-zone-aware timestamps. The fix is a cast before you write:

from pyspark.sql.functions import col  for c in df.columns:     if dict(df.dtypes)[c] == "timestamp_ntz":         df = df.withColumn(c, col(c).cast("timestamp"))  df.write.mode("append").synapsesql("my_warehouse.dbo.target_table")

This is not documented prominently enough. If you see a Py4JJavaError during write that mentions type conversion, timestamps are the first thing to check.

What you can’t write to

The connector writes to Warehouse tables only. You cannot write to the SQL analytics endpoint of a Lakehouse—it’s read-only. If you try, you’ll get an error. This seems obvious but trips people up because the same synapsesql() method handles both reads from Warehouses and Lakehouse SQL endpoints.

Private Link limitations

If Private Link is enabled at the workspace level, both read and write operations through the connector are unsupported. If Private Link is enabled at the tenant level only, writes are unsupported but reads still work. This is a significant limitation for security-conscious deployments. Check your network configuration before building pipelines that depend on this connector.

Time Travel is not supported

Fabric Data Warehouse now supports Time Travel queries. However, the Spark connector does not pass through Time Travel syntax. If you need to query a table as of a specific point in time, you’ll need to use a T-SQL connection directly rather than the synapsesql() method.

When to use Warehouse vs. Lakehouse as your serving layer

This is the architectural question that the connector’s existence forces you to answer. You’ve got data in your Lakehouse. Spark has transformed it. Now where does it go?

Use Lakehouse Delta tables when:

Your consumers are other Spark notebooks or Spark-based ML pipelines.
You need schema evolution flexibility (Delta’s schema merge).
You want to use OPTIMIZE, VACUUM, and Z-ORDER for table maintenance.
Your data scientists need direct file access through OneLake APIs.

Use Warehouse tables when:

Your primary consumers are Power BI reports or T-SQL analysts.
You need the Warehouse’s result set caching for repeated query patterns.
You need fine-grained security (RLS, CLS, OLS) that passes through to all consumers.
You want to use T-SQL stored procedures, views, and MERGE statements for downstream transformations.
You need cross-database queries that join Warehouse tables with Lakehouse tables or other Warehouse tables.

Use both when:

Spark processes and stores data in the Lakehouse (bronze → silver → gold medallion layers), then the connector writes final aggregations or serving tables to the Warehouse.
The Warehouse serves as the “last mile” between your data engineering work and your business intelligence layer.

The January 2026 GA of MERGE in Fabric Data Warehouse makes the “write to Warehouse” pattern significantly more useful. You can now do incremental upserts: Spark writes a staging table, then a T-SQL MERGE reconciles it with the target. This is a common pattern in data warehousing that was previously awkward in Fabric.

A concrete pattern: Spark ETL → Warehouse serving layer

Here’s the pattern I see working well in production:

# 1. Read from Lakehouse Delta tables (Spark native) bronze = spark.read.format("delta").load("Tables/raw_orders")  # 2. Transform in Spark silver = bronze.filter(col("status") != "cancelled") \     .withColumn("order_date", col("order_ts").cast("date")) \     .withColumn("amount_usd", col("amount") * col("fx_rate"))  gold = silver.groupBy("region", "order_date") \     .agg(         count("order_id").alias("order_count"),         sum("amount_usd").alias("total_revenue")     )  # 3. Write to Warehouse for Power BI consumption gold.write.mode("overwrite").synapsesql("analytics_warehouse.dbo.daily_revenue")

The Lakehouse owns the raw and transformed data. Spark does the heavy compute. The Warehouse serves the final tables to downstream consumers with T-SQL access, caching, and fine-grained security.

The alternative—writing gold tables to the Lakehouse and having Power BI connect via the SQL analytics endpoint—also works. But the SQL analytics endpoint has a metadata sync delay after Spark writes new data. The Warehouse table is immediately consistent after the COPY INTO completes. If your reporting needs to reflect the latest pipeline run without a sync lag, the Warehouse path is more reliable.

Cross-database queries: the glue between them

Once you have data in both a Lakehouse and a Warehouse in the same workspace, you can query across them using T-SQL cross-database queries from the Warehouse:

SELECT w.customer_id, w.total_revenue, l.customer_segment FROM analytics_warehouse.dbo.daily_revenue AS w JOIN my_lakehouse.dbo.customer_dim AS l     ON w.customer_id = l.customer_id

This means your Warehouse doesn’t need to contain all the data. It can hold the curated aggregations while joining against dimension tables that live in the Lakehouse. No data movement. No duplication. The SQL engine resolves both sources through OneLake.

Performance notes from the field

A few observations from real workloads:

Reads are faster than you expect. The TDS protocol connection to the Warehouse SQL engine is efficient. For typical analytical queries returning thousands to low millions of rows, the synapsesql() read is competitive with reading Delta files directly, especially when the Warehouse has statistics and result set caching enabled.

Writes are slower than Lakehouse writes. The two-phase staging + COPY INTO process adds overhead versus a direct df.write.format("delta").save() to Lakehouse tables. For a DataFrame with 10 million rows, expect the Warehouse write to take 2-5x longer than an equivalent Lakehouse Delta write. This is the tradeoff for getting immediate T-SQL access with full Warehouse capabilities.

Use parallel reads for large tables. The default single-partition read will bottleneck. Set numPartitions to match your Spark cluster’s available cores for large reads. The performance improvement is often 4-8x.

Proactive and incremental statistics refresh. As of January 2026, Fabric Data Warehouse supports proactive statistics refresh and incremental statistics. This means the query optimizer keeps statistics up to date automatically. Your synapsesql() reads benefit from better query plans without manual UPDATE STATISTICS calls.

The honest summary

The Spark connector for Fabric Data Warehouse is a well-designed bridge between two systems that many teams use side by side. It makes the read path simple and the write path possible without leaving your Spark notebook.

It is not a replacement for writing to Lakehouse Delta tables. It is an additional output path for when your downstream consumers need T-SQL, fine-grained security, result set caching, or immediate consistency. Use it when the Warehouse is the right serving layer. Don’t use it when Lakehouse is sufficient.

The biggest wins come from combining both: Spark for compute, Lakehouse for storage, Warehouse for serving. The connector is the plumbing that makes that architecture work without data pipelines in between.

If you’re heading to FabCon Atlanta (March 16-20, 2026), both the Data Warehouse and Data Engineering teams will be there. It’s a good place to pressure-test your architecture and see what’s coming next.

This post was written with help from anthropic/claude-opus-4-6

Lakehouse Table Optimization: VACUUM, OPTIMIZE, and Z-ORDER

If your Lakehouse tables are getting slower (or more expensive) over time, it’s often not “Spark is slow.” It’s usually table layout drift: too many small files, suboptimal clustering, and old files piling up.

In Fabric Lakehouse, the three table-maintenance levers you’ll reach for most are:

OPTIMIZE: compacts many small files into fewer, larger files (and can apply clustering)
Z-ORDER: co-locates related values to improve data skipping for common filters
VACUUM: deletes old files that are no longer referenced by the Delta transaction log (after a retention window)

Practical note: in Fabric, run these as Spark SQL in a notebook or Spark job definition (or use the Lakehouse maintenance UI). Don’t try to run them in the SQL Analytics Endpoint.

1) Start with the symptom: “small files” vs “bad clustering”

Before you reach for maintenance, quickly sanity-check what you’re fighting:

Many small files → queries spend time opening/reading lots of tiny Parquet files.
Poor clustering for your most common predicates (date, tenantId, customerId, region, etc.) → queries scan more data than they need.
Heavy UPDATE/DELETE/MERGE patterns → lots of new files + tombstones + time travel files.

If you only have small files, OPTIMIZE is usually your first win.

2) OPTIMIZE: bin-packing for fewer, bigger files

Basic compaction

OPTIMIZE my_table;

Target a subset (example: recent partitions)

OPTIMIZE my_table WHERE date >= date_sub(current_date(), 7);

A useful mental model: OPTIMIZE is rewriting file layout (not changing table results). It’s maintenance, not transformation.

3) Z-ORDER: make your filters cheaper

Z-Ordering is for the case where you frequently query:

WHERE tenantId = ...
WHERE customerId = ...
WHERE deviceId = ... AND eventTime BETWEEN ...

Example:

OPTIMIZE my_table ZORDER BY (tenantId, eventDate);

Pick 1–3 columns that dominate your interactive workloads. If you try to z-order on everything, you’ll mostly burn compute for little benefit.

4) VACUUM: clean up old, unreferenced files (carefully)

VACUUM is about storage hygiene. Delta keeps old files around to support time travel and concurrent readers. VACUUM deletes files that are no longer referenced and older than the configured retention threshold.

VACUUM my_table;

Two practical rules:

Don’t VACUUM aggressively unless you understand the impact on time travel / rollback.
Treat the retention window as a governance decision (what rollback window do you want?) not just a cost optimization.

5) Fabric-specific gotchas (the ones that actually bite)

Where you can run these commands

These are Spark SQL maintenance commands. In Fabric, that means notebooks / Spark job definitions (or the Lakehouse maintenance UI), not the SQL Analytics Endpoint.

V-Order and OPTIMIZE

Fabric also has V-Order, which is a Parquet layout optimization aimed at faster reads across Fabric engines. If you’re primarily optimizing for downstream read performance (Power BI/SQL/Spark), it’s worth understanding whether V-Order is enabled for your workspace and table writes.

A lightweight maintenance pattern that scales

Nightly/weekly: OPTIMIZE high-value tables (or recent partitions)
Weekly/monthly: Z-ORDER tables with stable query patterns
Monthly: VACUUM tables where your org’s time travel policy is clear

Treat it like index maintenance: regular, boring, measurable.

References

This post was written with help from ChatGPT 5.2

OneLake Shortcuts + Spark: Practical Patterns for a Single Virtual Lakehouse

If you’ve adopted Microsoft Fabric, there’s a good chance you’re trying to reduce the number of ‘copies’ of data that exist just so different teams and engines can access it.

OneLake shortcuts are one of the core primitives Fabric provides to unify data across domains, clouds, and accounts by making OneLake a single virtual data lake namespace.

For Spark users specifically, the big win is that shortcuts appear as folders in OneLake—so Spark can read them like any other folder—and Delta-format shortcuts in the Lakehouse Tables area can be surfaced as tables.

What a OneLake shortcut is (and isn’t)

A shortcut is an object in OneLake that points to another storage location (internal or external to OneLake).

Shortcuts appear as folders and behave like symbolic links: deleting a shortcut doesn’t delete the target, but moving/renaming/deleting the target can break the shortcut.

From an engineering standpoint, that means you should treat shortcuts as a namespace mapping layer—not as a durability mechanism.

Where you can create shortcuts: Lakehouse Tables vs Files

In a Lakehouse, you create shortcuts either under the top-level Tables folder or anywhere under the Files folder.

Tables has constraints: OneLake doesn’t support shortcuts in subdirectories of the Tables folder, and shortcuts in Tables are typically meant for targets that conform to the Delta table format.

Files is flexible: there are no restrictions on where you can create shortcuts in the Files hierarchy, and table discovery does not happen there.

If a shortcut in the Tables area points to Delta-format data, the lakehouse can synchronize metadata and recognize the folder as a table.

One documented gotcha: the Delta format doesn’t support table names with space characters, and OneLake won’t recognize any shortcut containing a space in the name as a Delta table.

How Spark reads from shortcuts

In notebooks and Spark jobs, shortcuts appear as folders in OneLake, and Spark can read them like any other folder.

For table-shaped data, Fabric automatically recognizes shortcuts in the Tables section of the lakehouse that have Delta/Parquet data as tables—so you can reference them directly from Spark.

Microsoft Learn also notes you can use relative file paths to read data directly from shortcuts, and Delta shortcuts in Tables can be read via Spark SQL syntax.

Practical patterns (what I recommend in real projects)

Pattern 1: Use Tables shortcuts for shared Delta tables you want to show up consistently across Fabric engines (Spark + SQL + Direct Lake scenarios via semantic models reading from shortcuts).

Pattern 2: Use Files shortcuts when you need arbitrary formats or hierarchical layouts (CSV/JSON/images, nested partitions, etc.) and you’re fine treating it as file access.

Pattern 3: Prefer shortcuts over copying/staging when your primary goal is to eliminate edge copies and reduce latency from data duplication workflows.

Pattern 4: When you’re operationalizing Spark notebooks, make the access path explicit and stable by using the shortcut path (the place it appears) rather than hard-coding a target path that might change.

Operational gotchas and guardrails

Because moving/renaming/deleting a target path can break a shortcut, add lightweight monitoring for “broken shortcut” failures in your pipelines (and treat them like dependency failures).

For debugging, the lakehouse UI can show the ABFS path or URL for a shortcut in its Properties pane, which you can copy for inspection or troubleshooting.

Outside of Fabric, services can access OneLake through the OneLake API, which supports a subset of ADLS Gen2 and Blob storage APIs.

Summary

Shortcuts give Spark a clean way to treat OneLake like a unified namespace: read shortcuts as folders, surface Delta/Parquet data in Tables as tables, and keep your project’s logical paths stable even when physical storage locations vary.

References

Unify data sources with OneLake shortcuts (Microsoft Learn)
Access OneLake shortcuts in an Apache Spark notebook (Microsoft Learn)
OneLake access and APIs (Microsoft Learn)

This post was written with help from ChatGPT 5.2

Why Spark teams should care about a Dataflow feature

What Recent data changes under pressure

What this feature doesn’t rescue you from

Preview discipline: run this like a survival checklist

The operating model for Spark leads

Share this:

What Eventstreams SQL actually fixes

The merge logic decision you cannot avoid

Checkpoints: boring, critical, and often broken by accident

The small files problem is not glamorous, but it will hurt you

Deletes: decide your philosophy before compliance decides for you

Monitoring: minimum viable or maximum regret

A practical starting playbook

Final take

Share this:

The Spark-to-Warehouse Connector in Fabric: What It Does, How It Breaks, and When to Use It

What the connector actually is

Reading: the part that mostly just works

Writing: the part with surprises

How writes actually work under the hood

Save modes

The timestamp_ntz gotcha

What you can’t write to

Private Link limitations

Time Travel is not supported

When to use Warehouse vs. Lakehouse as your serving layer

A concrete pattern: Spark ETL → Warehouse serving layer

Cross-database queries: the glue between them

Performance notes from the field

The honest summary

Share this:

1) Start with the symptom: “small files” vs “bad clustering”

2) OPTIMIZE: bin-packing for fewer, bigger files

Basic compaction

Target a subset (example: recent partitions)

3) Z-ORDER: make your filters cheaper

4) VACUUM: clean up old, unreferenced files (carefully)

5) Fabric-specific gotchas (the ones that actually bite)

Where you can run these commands

V-Order and OPTIMIZE

A lightweight maintenance pattern that scales

References

Share this:

What a OneLake shortcut is (and isn’t)

Where you can create shortcuts: Lakehouse Tables vs Files

How Spark reads from shortcuts

Practical patterns (what I recommend in real projects)

Operational gotchas and guardrails

Summary

References

Share this: