Lakehouse Table Optimization: VACUUM, OPTIMIZE, and Z-ORDER

If your Lakehouse tables are getting slower (or more expensive) over time, it’s often not “Spark is slow.” It’s usually table layout drift: too many small files, suboptimal clustering, and old files piling up.

In Fabric Lakehouse, the three table-maintenance levers you’ll reach for most are:

  • OPTIMIZE: compacts many small files into fewer, larger files (and can apply clustering)
  • Z-ORDER: co-locates related values to improve data skipping for common filters
  • VACUUM: deletes old files that are no longer referenced by the Delta transaction log (after a retention window)

Practical note: in Fabric, run these as Spark SQL in a notebook or Spark job definition (or use the Lakehouse maintenance UI). Don’t try to run them in the SQL Analytics Endpoint.

1) Start with the symptom: “small files” vs “bad clustering”

Before you reach for maintenance, quickly sanity-check what you’re fighting:

  • Many small files → queries spend time opening/reading lots of tiny Parquet files.
  • Poor clustering for your most common predicates (date, tenantId, customerId, region, etc.) → queries scan more data than they need.
  • Heavy UPDATE/DELETE/MERGE patterns → lots of new files + tombstones + time travel files.

If you only have small files, OPTIMIZE is usually your first win.

2) OPTIMIZE: bin-packing for fewer, bigger files

Basic compaction

OPTIMIZE my_table;

Target a subset (example: recent partitions)

OPTIMIZE my_table WHERE date >= date_sub(current_date(), 7);

A useful mental model: OPTIMIZE is rewriting file layout (not changing table results). It’s maintenance, not transformation.

3) Z-ORDER: make your filters cheaper

Z-Ordering is for the case where you frequently query:

  • WHERE tenantId = ...
  • WHERE customerId = ...
  • WHERE deviceId = ... AND eventTime BETWEEN ...

Example:

OPTIMIZE my_table ZORDER BY (tenantId, eventDate);

Pick 1–3 columns that dominate your interactive workloads. If you try to z-order on everything, you’ll mostly burn compute for little benefit.

4) VACUUM: clean up old, unreferenced files (carefully)

VACUUM is about storage hygiene. Delta keeps old files around to support time travel and concurrent readers. VACUUM deletes files that are no longer referenced and older than the configured retention threshold.

VACUUM my_table;

Two practical rules:

  1. Don’t VACUUM aggressively unless you understand the impact on time travel / rollback.
  2. Treat the retention window as a governance decision (what rollback window do you want?) not just a cost optimization.

5) Fabric-specific gotchas (the ones that actually bite)

Where you can run these commands

These are Spark SQL maintenance commands. In Fabric, that means notebooks / Spark job definitions (or the Lakehouse maintenance UI), not the SQL Analytics Endpoint.

V-Order and OPTIMIZE

Fabric also has V-Order, which is a Parquet layout optimization aimed at faster reads across Fabric engines. If you’re primarily optimizing for downstream read performance (Power BI/SQL/Spark), it’s worth understanding whether V-Order is enabled for your workspace and table writes.

A lightweight maintenance pattern that scales

  • Nightly/weekly: OPTIMIZE high-value tables (or recent partitions)
  • Weekly/monthly: Z-ORDER tables with stable query patterns
  • Monthly: VACUUM tables where your org’s time travel policy is clear

Treat it like index maintenance: regular, boring, measurable.

References

This post was written with help from ChatGPT 5.2

Fabric Spark Shuffle Tuning: AQE + partitions for Faster Joins

Shuffles are where Spark jobs go to get expensive: a wide join or aggregation forces data to move across the network, materialize shuffle files, and often spill when memory pressure spikes.

In Microsoft Fabric Spark workloads, the fastest optimization is usually the boring one: avoid the shuffle when you can, and when you can’t, make it smaller and better balanced.

This post lays out a practical, repeatable approach you can apply in Fabric notebooks and Spark job definitions.

1) Start with the simplest win: avoid the shuffle

If one side of your join is genuinely small (think lookup/dimension tables), use a broadcast join so Spark ships the small table to executors and avoids a full shuffle.

In Fabric’s Spark best practices, Microsoft explicitly calls out broadcast joins for small lookup tables as a way to avoid shuffles entirely.

Example (PySpark):

from pyspark.sql.functions import broadcast

fact = spark.read.table("fact_sales")
dim  = spark.read.table("dim_product")

# If dim_product is small enough, broadcast it
joined = fact.join(broadcast(dim), on="product_id", how="left")

If you can’t broadcast safely, move to the next lever.

2) Make the shuffle less painful: tune shuffle parallelism

Spark controls the number of shuffle partitions for joins and aggregations with spark.sql.shuffle.partitions (default: 200 in Spark SQL).

  • Too few partitions → huge partitions → long tasks, spills, and stragglers.
  • Too many partitions → tiny tasks → scheduling overhead, excess shuffle metadata, and unnecessary overhead.

Example (session-level setting):

spark.conf.set("spark.sql.shuffle.partitions", "400")

A decent heuristic is to start with something proportional to total executor cores and then iterate using the Spark UI (watch stage task durations, shuffle read/write sizes, and spill metrics).

3) Let Spark fix itself (when it can): enable AQE

Adaptive Query Execution (AQE) uses runtime statistics to optimize a query as it runs.

Fabric’s Spark best practices recommend enabling AQE to dynamically optimize shuffle partitions and handle skewed data automatically.

AQE is particularly helpful when:

  • Your input data distribution changes day-to-day
  • A static spark.sql.shuffle.partitions value is right for some workloads but wrong for others
  • You hit skew where a small number of partitions do most of the work

Example:

spark.conf.set("spark.sql.adaptive.enabled", "true")

4) Diagnose like you mean it: what to look for in Spark UI

When a job is slow, treat it like a shuffle problem until proven otherwise.

Look for:

  • Stages where a handful of tasks take dramatically longer than the median (classic skew)
  • Large shuffle read/write sizes concentrated in a small number of partitions
  • Spill (memory → disk) spikes during joins/aggregations

When you see skew, your options are usually:

  • Broadcast (if feasible)
  • Repartition on a better key
  • Salt hot keys (advanced)
  • Enable AQE and confirm it’s actually taking effect

A minimal checklist for Fabric Spark teams

  1. Use DataFrame APIs (keep Catalyst in play).
  2. Broadcast small lookup tables to avoid shuffles.
  3. Set a sane baseline for spark.sql.shuffle.partitions.
  4. Enable AQE and validate in the query plan / UI.
  5. Iterate with the Spark UI: measure, change one thing, re-measure.

References

This post was written with help from ChatGPT 5.2