DevOps – Christopher Finlan

Three days ago, Microsoft promoted fabric-cicd from community project to officially supported tool. That Python library your team has been running in a “we’re still figuring out our deployment process” sort of way now carries Microsoft’s name and their support commitment.

That shift matters in three concrete places. First, your compliance team can stop asking “is this thing even supported?” Second, you can open Microsoft support tickets when it breaks. Third, the roadmap is no longer a volunteer effort. Features will land faster. Bugs will get fixed on a schedule.

But here’s where most teams stall. They read the announcement, nod approvingly, and then do absolutely nothing different. The notebook still gets deployed by clicking sync in the browser. The lakehouse GUID is still hardcoded. The “production” workspace is still one bad merge away from serving yesterday’s dev code to the entire analytics team.

An announcement without an execution plan is just news. Let’s build the plan.

What Fabric-CICD Does (and Where It Stops)

Understand the boundaries before you reorganize your deployment story. fabric-cicd is a Python library. You give it a Git repository, a target workspace ID, and a list of item types. It reads the item definitions from the repo, resolves dependencies between them, applies parameter substitutions, and pushes everything to the workspace. It can also remove orphan items that exist in the workspace but no longer appear in your repo.

It supports 25 item types: Notebooks, SparkJobDefinitions, Environments, Lakehouses, DataPipelines, SemanticModels, Warehouses, and 18 others. Every deployment is a full deployment. No commit diffs, no incremental updates. The entire in-scope state gets pushed every time.

Where it stops: it won’t manage your Spark compute sizing, it won’t migrate lakehouse data between environments, and it won’t coordinate multi-workspace transactions atomically. Those gaps are yours to fill. That’s not a weakness. A tool that owns its scope and does it well beats one that covers everything and nails nothing.

Prerequisite Zero: Get Your Git House in Order

This is the part that takes longer than anyone budgets for.

fabric-cicd reads from a Git repository. If your Fabric workspace isn’t connected to one, the tool has nothing to deploy. And plenty of Spark teams are still running workspaces where notebooks were born in the browser, edited in the browser, and will die in the browser without ever touching version control.

Connect your workspace to Azure DevOps or GitHub through Fabric’s Git Integration. Every notebook, every Spark job definition, every environment configuration goes into source control. All of it.

If your repo currently contains items named notebook_v2_final_FINAL_USE_THIS_ONE — and honestly, most of us have been there — now’s the time to clean that up before automating. Automating a disorganized repo just moves the disorganization faster. Getting the foundation right first saves real time down the road.

Your target state when this prerequisite is done: a main branch that mirrors production, feature branches for development work, and a merge strategy the whole team agrees on. fabric-cicd reads from a directory on disk. What it reads needs to be coherent.

The Parameter File: The Single Most Important Artifact

The parameter.yml file is where fabric-cicd learns the difference between your dev environment and production. Without it, you’re deploying identical configurations everywhere, which means your production notebooks will happily point at your dev lakehouse.

For Spark teams, four categories of parameter entries matter:

Default Lakehouse IDs. Every notebook binds to a lakehouse by GUID. In dev, that GUID points to your sandbox with test data. In production, it points to the lakehouse with three months of curated, retention-managed data. The parameter file swaps those GUIDs at deploy time. Miss one, and your production job reads from a lakehouse that got wiped last Tuesday.

Default Lakehouse Workspace IDs. If your production lakehouse lives in a separate workspace from dev (and it should), this mapping covers that scope. Lakehouse GUIDs alone aren’t enough when workspaces differ between environments.

Connection strings. Any notebook that pulls from an external data source needs environment-specific connection details. Hardcoded connection strings are how you end up running your production Spark cluster against a dev SQL database. That kind of mismatch can get expensive quickly — and it’s entirely preventable with proper parameterization.

Notebook parameter cells. Fabric lets you define parameter cells in notebooks. Every value that changes between environments belongs there, referenced by parameter.yml. Not in a comment. Not in a variable halfway down the notebook. In the parameter cell, where the tooling can find it.

The mechanism is find-and-replace. fabric-cicd scans your repository files for specific strings and swaps in the values for the target environment. This means the GUIDs in your repo must be consistent. If someone manually edited a lakehouse ID through the browser after a sync, the parameter file won’t catch the mismatch. Deployments will succeed. The notebook will fail. Those are the worst kind of bugs: silent ones.

Build Your Pipeline in Four Stages

Here’s a pipeline structure built for Spark teams, in the order things should execute:

Stage 1: Validate. Run your tests before anything deploys. If you have PySpark unit tests (even five of them), execute them against a local SparkSession or a lightweight Fabric environment. This catches broken imports, renamed functions, and bad type signatures. The goal isn’t 100% test coverage. The goal is catching the obvious failures before they reach a workspace anyone else depends on.

Stage 2: Build. Initialize the FabricWorkspace object with your target workspace ID, environment name, repository path, and scoped item types. For Spark teams, start with ["Notebook", "SparkJobDefinition", "Environment", "Lakehouse"]. Do not scope every item type on day one. Start with the items you deploy weekly. Expand scope after the first month, when you’ve seen how it behaves.

Stage 3: Deploy. Call publish_all_items(). The tool resolves dependency ordering, so if a notebook depends on a lakehouse that depends on an environment configuration, the sequence is handled. After publishing, call unpublish_all_orphan_items() to clean up workspace items that no longer appear in the repo. Skipping orphan cleanup means your workspace accumulates dead items that confuse the team and waste capacity.

Stage 4: Verify. This is the stage teams skip, and the one that saves them. After deployment, run a smoke test against the target workspace. Can the notebook open? Does it bind to the correct lakehouse? Can a lightweight execution complete without errors? A deployment that returns exit code zero but leaves notebooks pointing at a deleted lakehouse is not a successful deployment. Your pipeline shouldn’t treat it as one.

Guardrails Worth the Setup Cost

Guardrails turn a pipeline from a deployment mechanism into a safety net. These four are worth the setup time:

Approval gates. Require explicit human approval before any deployment to Production. fabric-cicd won’t enforce this for you. Wire it into your pipeline platform: Azure DevOps release gates, GitHub Actions environments with required reviewers. The first time a broken merge auto-deploys to production, you’ll wish you had spent the twenty minutes setting this up.

Service principal authentication. Run your pipeline under a service principal, not a user account. Give the principal workspace contributor access on the target workspace. Nothing more. When someone leaves the team or changes roles, deployments keep working because they never depended on that person’s credentials.

Tested rollback. Since fabric-cicd does full deployments from the repo, rollback means redeploying the last known-good commit. Conceptually clean. But “conceptually clean” doesn’t help you during an incident when stakeholders need answers fast. Test the rollback. Revert a deployment on a Tuesday afternoon when nothing is on fire. Confirm the workspace returns to its previous state. If you haven’t tested it, your rollback plan is still untested — and untested plans have a way of surprising you at the worst possible moment.

Deployment artifacts. Every pipeline run should log which items deployed, which parameters were substituted, and which orphans were removed. When production breaks and someone asks “what changed since yesterday?”, the answer should take thirty seconds, not three hours of comparing workspace states by hand.

Spark-Specific Problems Nobody Warns You About

General CI/CD guidance covers the broad strokes. Spark teams hit problems that live in the details:

Lakehouse bindings are buried in notebook content. The notebook-content.py file contains lakehouse and workspace GUIDs. If your parameter.yml misses even one of these, the production notebook opens to a “lakehouse not found” error. Audit every notebook, including the utility notebooks that other notebooks call with %run. Those hidden dependencies are where the bindings go wrong.

Environment items gate notebook execution. When your Spark notebooks depend on a custom Environment with specific Python libraries or Spark configuration properties, that Environment must exist in the target workspace before the notebooks arrive. The fabric-cicd dependency resolver handles this automatically, but only if Environment is in your item_type_in_scope. Scope just Notebook without Environment, and you’ll get clean deployments followed by runtime failures when the expected libraries don’t exist.

SparkJobDefinitions are not notebooks. SJDs carry executor counts, driver memory settings, reference files, and command-line arguments. All environment-specific values in these properties need coverage in your parameter file. Teams that parameterize their notebooks thoroughly and forget about their SJDs discover the gap when a production batch job runs with dev-sized executors and takes four times longer than expected.

Full deployment at scale needs scoping. Fifty notebooks deploy in minutes. Three hundred notebooks take longer and increase your blast radius. If your workspace has grown large, segment your repository by domain or narrow item_type_in_scope per pipeline to keep deployment times predictable and failures contained to a known set of items.

A Four-Week Migration Path

Starting from zero, here’s a timeline that’s aggressive but achievable:

Week 1: Git integration. Connect your workspace to source control. Rename items that need renaming. Agree on a branching strategy with the team. Write it down. Nothing deploys this week. This is foundation work, and skipping it makes everything after it harder.

Week 2: First deployment. Install fabric-cicd, write your initial parameter.yml, and run a deployment to a test workspace from the command line. Intentionally break the lakehouse binding in the parameter file. See what the error looks like. Fix it. Run it again. You want the team to recognize deployment failures before they encounter one under pressure.

Week 3: Pipeline construction. Build the CI/CD pipeline for Dev-to-Test promotion. Add approval gates, service principal auth, logging, and the verify stage. Run the pipeline ten times. Deliberately introduce a bad merge and watch the pipeline catch it. If it doesn’t catch it, fix the pipeline.

Week 4: Production extension. Extend the pipeline to include Production as a target. Add smoke tests. Test your rollback procedure. Write the runbook. Walk the team through it. Make sure at least two people can operate the pipeline without you in the room.

Four weeks. Not a quarter. Not a planning exercise that stalls in sprint three. A month of focused, methodical work that moves your Spark team from manual deployment to a process that runs the same way every time, whether it’s Tuesday at noon or Saturday at midnight.

The Real Takeaway

Microsoft giving fabric-cicd the official stamp means enterprise teams can stop hesitating. The library will get more attention, faster bug fixes, and broader item type support going forward.

But the tool is only half the story. A perfectly automated pipeline that deploys unparameterized notebooks to the wrong lakehouse is worse than manual deployment, because at least manual deployment forces someone to look at what they’re pushing. Automation works best when it’s built on a disciplined foundation — the checklist, the parameter file, the tested rollback, the verify stage.

Build the checklist. Work the checklist. Invest in the hard parts now, and they’ll pay you back in every deployment after.

This post was written with help from anthropic/claude-opus-4-6

Agents are fun when they’re clever. They’re useful when they’re boring.

If you’re running OpenClaw as an always-on assistant (cron jobs, health checks, publishing pipelines, internal dashboards), the failure mode isn’t usually “it breaks once.” It’s it flakes intermittently and you can’t tell if the problem is upstream, your network, your config, or the agent.

This post is the operational playbook that moved my setup from “cool demo” to “production-ish”: fewer false alarms, faster debugging, clearer artifacts, and tighter cost control.

The production baseline (don’t skip this)

Before you add features, lock the boring stuff:

One source of truth for cron/job definitions.
A consistent deliverables folder (so outputs don’t vanish into chat history).
A minimal runbook per job (purpose, dependencies, failure modes, disable/rollback).

Observability: prove what happened

When something fails, you want receipts — not vibes.

Minimum viable run-level observability:

job_name, job_id, run_id
start/end timestamp (with timezone)
what the job tried to do (high level)
what it produced (file paths, URLs)
what it depended on (network/API/tool)
the error and the evidence (HTTP status, latency, exception type)

Split latency: upstream vs internal

If Telegram is “slow,” is that Telegram API RTT/network jitter, internal queueing, or a slow tool call? Instrument enough to separate those — otherwise you’ll waste hours fixing the wrong layer.

Alert-only health checks (silence is success)

If a health check is healthy 99.9% of the time, it should not message you 99.9% of the time.

prints NO_REPLY when healthy
emits one high-signal alert line when broken
includes evidence (what failed, how, and where to look)

Example alert shape:

⚠️ health-rollup: telegram_rtt_p95=3.2s (threshold=2.0s) curl=https://api.telegram.org/ ts=2026-02-10T03:12:00-08:00

Cron hygiene: stop self-inflicted outages

Idempotency: re-runs don’t duplicate deliverables.
Concurrency control: don’t let overlapping runs pile up.
Deterministic first phase: validate dependencies before doing expensive work.
Deadman checks: alert if a job hasn’t run (or hasn’t delivered) in N hours.

Evidence-based alerts: pages should come with receipts

A useful alert answers: (1) what failed, (2) where is the evidence (log path / file path / URL), and (3) what’s the next action. Anything else is notification spam.

Cost visibility: make it measurable

batch work; avoid polling
cap retries
route routine work to cheaper models
log model selection per run
track token usage from local transcripts (not just “current session model”)

Deliverables: put outputs somewhere that syncs

Chat is not a file system. Every meaningful workflow should write artifacts to a synced folder (e.g., OneDrive): primary output, supporting evidence, and run metadata.

Secure-by-default: treat inputs as hostile

Separate read (summarize) from act (send/delete/post).
Require explicit confirmation for destructive/external actions.
Prefer allowlists over arbitrary shell.

Runbooks: make 2am fixes boring

purpose
schedule
dependencies
what “healthy” looks like
what “broken” looks like
how to disable
how to recover

What we changed (the short version)

Consolidated multiple probes into one evidence-based rollup.
Converted recurring checks to alert-only.
Standardized artifacts into a synced deliverables folder.
Added a lightweight incident runbook.
Put internal dashboards behind Tailscale on separate ports.

This post was written with help from ChatGPT 5.2

Tag: DevOps

fabric-cicd Is Now Officially Supported — Here’s Your Production Deployment Checklist

What Fabric-CICD Does (and Where It Stops)

Prerequisite Zero: Get Your Git House in Order

The Parameter File: The Single Most Important Artifact

Build Your Pipeline in Four Stages

Guardrails Worth the Setup Cost

Spark-Specific Problems Nobody Warns You About

A Four-Week Migration Path

The Real Takeaway

Running OpenClaw in Production: Reliability, Alerts, and Runbooks That Actually Work

The production baseline (don’t skip this)

Observability: prove what happened

Split latency: upstream vs internal

Alert-only health checks (silence is success)

Cron hygiene: stop self-inflicted outages

Evidence-based alerts: pages should come with receipts

Cost visibility: make it measurable

Deliverables: put outputs somewhere that syncs

Secure-by-default: treat inputs as hostile

Runbooks: make 2am fixes boring

What we changed (the short version)

What Fabric-CICD Does (and Where It Stops)

Prerequisite Zero: Get Your Git House in Order

The Parameter File: The Single Most Important Artifact

Build Your Pipeline in Four Stages

Guardrails Worth the Setup Cost

Spark-Specific Problems Nobody Warns You About

A Four-Week Migration Path

The Real Takeaway

Share this:

The production baseline (don’t skip this)

Observability: prove what happened

Split latency: upstream vs internal

Alert-only health checks (silence is success)

Cron hygiene: stop self-inflicted outages

Evidence-based alerts: pages should come with receipts

Cost visibility: make it measurable

Deliverables: put outputs somewhere that syncs

Secure-by-default: treat inputs as hostile

Runbooks: make 2am fixes boring

What we changed (the short version)

Share this: