
Your Delta tables are drowning. Thousands of tiny Parquet files pile up after every streaming microbatch. Power BI dashboards stall on cold-cache queries. SQL analytics endpoints grind through fragmented row groups. And somewhere in the middle of the medallion architecture, a Spark job is rewriting perfectly good files because nobody told it they were already compacted.
This is the small-file problem at scale — and in Microsoft Fabric, where a single Delta table can serve Spark, SQL analytics endpoint, Power BI Direct Lake, and Warehouse simultaneously, it becomes a cross-workload survival situation. Microsoft recently published a comprehensive cross-workload table maintenance guide that provides a clear map out. Here’s how to use it.
Every Engine Wants Something Different
The core challenge is that each consumption engine has a different idea of what an “optimally sized” file looks like. Get this wrong and you optimize for one consumer while punishing another.
Here’s the terrain:
- Spark reads efficiently across a wide range — 128 MB to 1 GB depending on table size. V-Order isn’t required and adds 15–33% write overhead. Spark cares about parallelism, not VertiPaq encoding.
- SQL analytics endpoint and Warehouse want files around 400 MB with roughly 2 million rows per row group, plus V-Order enabled for an approximate 10% read improvement.
- Power BI Direct Lake is the most demanding consumer. It needs V-Order (delivering 40–60% cold-cache improvement), row groups of 8 million+ rows, and minimal file count to reduce transcoding overhead.
If you serve all three from the same Gold table, you need to make deliberate tradeoffs — or maintain multiple copies optimized for different patterns. Storage is cheap relative to compute. Compute wasted on bad file layouts is not.
The Three Commands That Keep You Alive
Table maintenance in Fabric boils down to three operations: OPTIMIZE, VACUUM, and the configuration pair of auto-compaction and optimize write. Each one addresses a different failure mode.
OPTIMIZE: Bin Compaction
OPTIMIZE consolidates small files into larger ones. It is your primary weapon against file fragmentation:
-- Basic compaction
OPTIMIZE schema_name.table_name
-- With V-Order for Power BI consumers
OPTIMIZE schema_name.table_name VORDER
-- With Z-Order for selective filter queries
OPTIMIZE schema_name.table_name ZORDER BY (region, event_date)
A critical detail: OPTIMIZE is a Spark SQL command. It runs in notebooks, Spark job definitions, and the Lakehouse Maintenance UI. You cannot run it from the SQL analytics endpoint or Warehouse SQL editor.
Before you optimize blindly, use the dry-run option to assess scope:
OPTIMIZE schema_name.table_name DRY RUN
This returns the files eligible for rewriting without touching the table — essential for estimating cost before committing compute.
VACUUM: Dead File Cleanup
After OPTIMIZE rewrites files, the old versions remain on disk for time travel. VACUUM removes files the Delta log no longer references:
-- Default 7-day retention
VACUUM schema_name.table_name
-- Explicit retention
VACUUM schema_name.table_name RETAIN 168 HOURS
The default seven-day retention exists for good reason: concurrent readers and writers may still reference those files. Drop below seven days and you risk reader failures or table corruption. If you must shorten retention, set spark.databricks.delta.retentionDurationCheck.enabled to false — but think carefully before you do.
Auto-Compaction + Optimize Write: Prevention Over Cure
Rather than waiting for file fragmentation to become a problem, these two features prevent it during ingestion:
Optimize write performs pre-write compaction, generating fewer, larger files at write time:
spark.conf.set('spark.databricks.delta.optimizeWrite.enabled', 'true')
Auto-compaction evaluates partition health after each write and triggers synchronous compaction when fragmentation is detected:
spark.conf.set('spark.databricks.delta.autoCompact.enabled', 'true')
These two work best together. Optimize write reduces the number of small files created; auto-compaction handles whatever fragmentation remains. Microsoft’s documentation recommends auto-compaction over manually scheduled OPTIMIZE jobs for most ingestion pipelines, noting it “generally outperforms scheduled compaction jobs at maximizing read/write performance.”
The Medallion Layer Checklist
The right maintenance strategy depends on where the table sits in your medallion architecture. Here is a concrete, layer-by-layer breakdown:
Bronze (Landing Zone)
- Priority: Ingestion speed
- Auto-compaction: Enable (optional — can sacrifice for raw speed)
- Optimize write: Enable
- V-Order: No (unnecessary write overhead)
- Liquid Clustering: No
- Scheduled OPTIMIZE: Optional
- Rule: Never serve Bronze tables directly to SQL analytics endpoint or Power BI Direct Lake.
Silver (Curated Zone)
- Priority: Balance ingestion and query performance
- Auto-compaction: Enable
- Optimize write: Enable
- V-Order: Optional (enable if SQL or Power BI consumers query this layer)
- Liquid Clustering or Z-Order: Recommended
- Scheduled OPTIMIZE: Aggressive
- Target file size: 128–256 MB
Gold (Serving Zone)
- Priority: Read performance for analytics
- Auto-compaction: Enable
- Optimize write: Required
- V-Order: Required for Power BI Direct Lake; beneficial for SQL
- Liquid Clustering: Required for optimal file skipping
- Scheduled OPTIMIZE: Aggressive
- Target file size: 400 MB to 1 GB
For Gold tables, tune the target based on the primary consumer:
| Consumer | V-Order | Target File Size | Row Group Size |
|---|---|---|---|
| SQL analytics endpoint | Yes | 400 MB | 2M rows |
| Power BI Direct Lake | Yes | 400 MB–1 GB | 8M+ rows |
| Spark | Optional | 128 MB–1 GB | 1–2M rows |
V-Order: Know When to Pay the Tax
V-Order applies VertiPaq-compatible sorting, encoding, and compression at write time. The performance gains for Power BI Direct Lake — 40–60% on cold-cache queries — make it indispensable for Gold-layer tables feeding dashboards. But V-Order adds 15–33% to write time and provides no inherent benefit for Spark-to-Spark pipelines.
The decision framework:
- Gold tables → Power BI or SQL consumers: V-Order on.
- Bronze/Silver tables → Spark pipelines only: V-Order off.
- Mixed consumers: Maintain separate copies — a Spark-optimized Silver table and a V-Ordered Gold table.
Starting with Fabric Runtime 1.3, V-Order is applied automatically during OPTIMIZE statements. You no longer need the deprecated spark.sql.parquet.vorder.enable session setting. Set it at the table level for consistency:
ALTER TABLE schema_name.gold_table
SET TBLPROPERTIES ('delta.parquet.vorder.enabled' = 'true')
Liquid Clustering vs. Z-Order
Liquid Clustering is the recommended approach for new tables. Unlike static partitioning, it adapts to changing query patterns and provides better file skipping. Define it at table creation:
CREATE TABLE schema_name.events (
id INT,
category STRING,
event_date DATE
) CLUSTER BY (category)
Use Z-Order instead when your table is already partitioned (Liquid Clustering does not work with partitioned tables) or when queries filter on two or more columns together.
One critical gotcha: data is only clustered when OPTIMIZE runs. Regular write operations do not apply clustering. Without a compaction strategy, you get zero benefit from Liquid Clustering — the layout never materializes.
Diagnosing Table Health
Before optimizing anything, assess where you stand:
from delta.tables import DeltaTable
details = spark.sql("DESCRIBE DETAIL schema_name.table_name").collect()[0]
print(f"Table size: {details['sizeInBytes'] / (1024**3):.2f} GB")
print(f"Number of files: {details['numFiles']}")
avg_file_mb = (details['sizeInBytes'] / details['numFiles']) / (1024**2)
print(f"Average file size: {avg_file_mb:.2f} MB")
Healthy tables have evenly distributed file sizes within 2× of each other. Files under 25 MB signal fragmentation. Files over 2 GB reduce parallelism. Use DESCRIBE HISTORY to review write patterns and check whether auto-compaction has been running.
Set It at the Table Level
A final, critical best practice: prefer table properties over session configurations. Session settings only apply to the current Spark session and disappear when the session ends. Table properties persist across sessions and ensure consistent behavior regardless of which job or notebook writes to the table:
CREATE TABLE schema_name.optimized_table (
id INT,
data STRING
) TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.parquet.vorder.enabled' = 'true'
)
This eliminates the class of bugs where one pipeline sets optimize write and another does not, producing an inconsistent file layout that OPTIMIZE then has to clean up.
The Bottom Line
Table maintenance in Fabric is not a set-it-and-forget-it operation. It is a deliberate strategy tied to your data’s lifecycle: fast ingestion at Bronze, balanced reads at Silver, and tuned-to-the-consumer performance at Gold. The tools — OPTIMIZE, VACUUM, auto-compaction, V-Order, Liquid Clustering — are all available. The question is whether you deploy them with intention.
Start by auditing your Gold tables. Check file sizes and distributions. Enable auto-compaction and optimize write at the table level. Apply V-Order where Power BI or SQL consumes the data. Schedule aggressive OPTIMIZE runs on Silver and Gold layers. And run VACUUM weekly to reclaim storage.
Your tables will thank you. Your dashboards will thank you faster.
—
This post was written with help from Claude Opus 4.6
