Dataflow Gen2 – Christopher Finlan

At 8:07 a.m., nobody on a data engineering team is debating architecture purity. You’re trying to get back to the exact source you were fixing yesterday before another downstream notebook fails and somebody asks for an ETA.

That’s the problem Microsoft Fabric’s Recent data feature targets.

The feature landed in the February 2026 Fabric update and is currently in preview. It sounds small: Dataflow Gen2 remembers the specific items you used recently — tables, files, folders, databases, and sheets — and lets you load them directly into the editing canvas. For Spark-heavy teams, though, this is less of a UX tweak and more of a way to stop bleeding time in the first mile of work.

And yes, it’s still a preview feature. Treat it like a mountain route in unstable weather: useful, fast, and not something you trust blindly.

Why Spark teams should care about a Dataflow feature

A lot of Spark teams still frame Dataflow Gen2 as somebody else’s tool. That framing is outdated.

Dataflow Gen2 automatically creates staging Lakehouse and Warehouse items in your workspace. If your team’s workflow includes Dataflow-based ingestion and Spark-based transformation, the handoff between those steps is real. It’s your daily route.

Here’s the hard lesson: if your ingestion layer touches Dataflow Gen2, then UI friction inside Dataflow is your Spark team’s problem too.

What to do about it:

Write down your ingestion handoffs in plain language: source to Dataflow Gen2 to staging Lakehouse/Warehouse to Spark notebooks.
Mark where engineers repeatedly reconnect to the same sources. That’s where Recent data pays off first.

What Recent data changes under pressure

Recent data does one thing that matters: it remembers specific assets, not just abstract connections.

When you return to a fix, you’re not restarting the expedition from base camp. You get dropped closer to the problem. You can pull the item directly into the editing canvas and keep moving.

For teams, this changes the rhythm of incident response and iteration:

You get back to source-level corrections faster.
You reduce the chance that someone reconnects to the wrong similarly-named object while moving too fast.
You spend less team energy on navigation and more on data correctness.

None of this is glamorous. It’s also exactly where engineering throughput gets won.

Try this: during your next defect cycle, track one metric for a week — time from “issue found” to “source query/table reopened in Dataflow Gen2.” If that number drops after using Recent data, keep leaning in. If it doesn’t, your bottleneck is elsewhere.

What this feature doesn’t rescue you from

Teams love to over-credit new features. Recent data is a navigation accelerator. It’s not governance. It’s not validation. It’s not a replacement for naming discipline. And because it’s in preview, it’s not a foundation for critical operational assumptions.

If your source naming is chaotic, Recent data will surface chaos faster.

If your validation is weak, Recent data will help you ship mistakes sooner.

If your runbooks are vague, Recent data won’t magically teach new engineers what “correct” looks like.

Pair it with a minimum Spark validation pass after ingestion updates: schema check, null expectation, row-count sanity check. Keep this lightweight and repeatable. The point is fast feedback, not ceremony.

Preview discipline: run this like a survival checklist

Because Recent data is in preview, your team should operate with explicit guardrails.

Test in development first. Don’t roll workflow assumptions into production muscle memory before your team has used the feature in real edits.

Keep a source-of-truth map. Recent data is convenience. Your documented source map is control. Keep both.

Standardize names now. If a human can confuse two source objects at a glance, they will. Fix names before speed amplifies mistakes.

Define a fallback path. If the recent list doesn’t have what you need, nobody should improvise. Document the manual reconnect path and keep it current.

Review preview behavior monthly. If the feature behavior shifts while in preview, your team should notice fast and adjust intentionally. Assign one owner for “preview watch” each month. Their job: test the core flow, confirm assumptions still hold, alert the team if anything drifts.

The operating model for Spark leads

If you lead a Spark data engineering team, the decision is straightforward.

Use Recent data. Absolutely use it. But use it like a rope, not like wings.

A rope gets you through rough terrain faster when the team is clipped in, communicating, and following route discipline. Wings are what people imagine they have right before they step into empty air.

In practice:

Adopt the feature for speed.
Keep your documentation for continuity.
Keep naming conventions strict for safety.
Keep Spark-side validation for quality.
Treat preview status as a real risk signal, not legal fine print.

That combination is where this feature becomes meaningful. Not because it’s flashy. Because it removes repeated friction at exactly the point where your team loses focus, burns time, and compounds small mistakes.

In data engineering, the catastrophic failures usually start as tiny oversights repeated at scale. Recent data removes one class of those oversights — the constant re-navigation tax — but only if you wrap it in disciplined operating habits.

One less avoidable stumble on steep ground, so your team can spend its strength on the parts of the climb that actually require judgment.

This post was written with help from anthropic/claude-opus-4-6

Somewhere in a Fabric workspace right now, two teams are maintaining the same transformation twice.

The BI team owns it in Power Query. The Spark team rewrote it in PySpark so a notebook could run it on demand. Both versions work. Both versions drift. Both versions break at different times.

That was normal.

Microsoft’s new Execute Query API (preview) is the first real shot at ending that duplication. It lets you execute Power Query (M) through a public REST API from notebooks, pipelines, or any HTTP client, then stream results back in Apache Arrow format.

For Spark teams, this isn’t a minor feature. It changes where transformation logic can live.

What actually shipped

At a technical level, the API is simple:

Endpoint: POST /v1/workspaces/{workspaceId}/dataflows/{dataflowId}/executeQuery
Input: a queryName, with optional customMashupDocument (full M script)
Output: Arrow stream (application/vnd.apache.arrow.stream)

The execution context comes from a Dataflow Gen2 artifact in your workspace. Its configured connections determine what data sources the query can access and which credentials are used.

That single detail matters more than it looks. You’re not just “calling M from Spark.” You’re running M under dataflow-governed connectivity and permissions.

Why Spark engineers should care

Before this API, Spark teams usually had two options:

Rewrite M logic in PySpark
Or wait for a dataflow refresh and consume the output later

Neither is great. Rewrites create long-term maintenance debt. Refresh handoffs add latency and orchestration fragility.

Now you can execute the transformation inline and keep moving.

A minimal call path looks like this:

import requests import pyarrow as pa  response = requests.post(url, headers=headers, json=request_body, stream=True)  with pa.ipc.open_stream(response.raw) as reader:     pandas_df = reader.read_pandas()  spark_df = spark.createDataFrame(pandas_df)

No CSV hop. No JSON schema drift. No custom parsing layer.

The non-negotiable constraints

This feature is useful, but it is not magic. There are hard boundaries.

90-second timeout
– Query evaluations must complete within 90 seconds.
– This is ideal for fast lookups, enrichment, and reference joins—not heavy batch reshaping.
Read-only execution
– The API executes queries only. It doesn’t support write actions.
– If your notebook flow assumes “query + write” in one API step, redesign it.
Native query rule for custom mashups
– customMashupDocument does not allow native database queries.
– But if a query defined inside the dataflow itself uses native queries, that query can be executed.
– This distinction will trip people if they treat inline M and stored dataflow queries as equivalent.
Performance depends on folding and query complexity
– Bad folding or expensive transformations can burn your 90-second window quickly.
– You need folding-aware query reviews before production rollout.

Practical rollout plan for Spark teams

If you lead a Fabric Spark team, do this in order.

1) Inventory duplication first

Build a short list of transformations currently duplicated between M and PySpark. Start with transformations that are stable, reused often, and mostly read-oriented.

2) Stand up a dedicated execution dataflow

Create one Dataflow Gen2 artifact specifically for API-backed execution contexts.

Keep connections explicit and reviewed
Restrict who can modify those connections
Treat the artifact like infrastructure, not ad hoc workspace clutter

3) Wrap Execute Query behind one notebook utility

Don’t let every notebook hand-roll HTTP logic. Create one shared helper that handles:

token acquisition
request construction
Arrow stream parsing
error handling
timeout/response logging

If the API returns 202 (long-running operation), honor Location and Retry-After instead of guessing polling behavior.

4) Add governance checks before scale

Because execution runs under dataflow connection scope, validate:

who can execute
what connections they indirectly inherit
which data sources become reachable through that path

If your governance model assumes notebook identity is the only control plane, this API changes that assumption.

5) Monitor capacity from day one

Microsoft surfaces this usage in Capacity Metrics as “Dataflows Gen2 Run Query API”, billed on the same meter family as Dataflow Gen2 refresh operations. Watch this early so you don’t discover new spend after adoption is already wide.

Where this fits (and where it doesn’t)

Use it when you need:

shared transformation logic between BI and engineering
fast, read-oriented query execution from Spark/pipelines/apps
connector and gateway reach already configured in dataflows

Avoid it when you need:

long-running transformations
write-heavy jobs
mission-critical production paths with zero preview risk tolerance

The REST API docs still mark this as preview and “not recommended for production use.” Treat that warning as real, not ceremonial.

The organizational shift hiding behind the API

The technical win is straightforward: fewer rewrites, faster integration, cleaner data handoffs.

The harder change is social.

When Spark notebooks can directly execute M, ownership lines between BI and data engineering need to be explicit. Who owns business logic? Who owns runtime reliability? Who approves connection scope?

Teams that answer those questions early will move fast.

Teams that don’t will just reinvent the same duplication problem with a new endpoint.

Source notes

Microsoft Fabric Blog: Evaluate Power Query Programmatically in Microsoft Fabric (Preview)
https://blog.fabric.microsoft.com/en-US/blog/execute-power-query-programmatically-in-microsoft-fabric/
Microsoft Learn REST API: Query Execution – Execute Query
https://learn.microsoft.com/en-us/rest/api/fabric/dataflow/query-execution/execute-query
Microsoft Learn: Pricing for Dataflow Gen2
https://learn.microsoft.com/en-us/fabric/data-factory/pricing-dataflows-gen2

This post was written with help from anthropic/claude-opus-4-6.

Tag: Dataflow Gen2

What “Recent data” in Fabric means for Spark teams when time is the real bottleneck

Why Spark teams should care about a Dataflow feature

What Recent data changes under pressure

What this feature doesn’t rescue you from

Preview discipline: run this like a survival checklist

The operating model for Spark leads

What “Execute Power Query Programmatically” Means for Fabric Spark Teams

What actually shipped

Why Spark engineers should care

The non-negotiable constraints

Practical rollout plan for Spark teams

1) Inventory duplication first

2) Stand up a dedicated execution dataflow

3) Wrap Execute Query behind one notebook utility

4) Add governance checks before scale

5) Monitor capacity from day one

Where this fits (and where it doesn’t)

The organizational shift hiding behind the API

Source notes

Why Spark teams should care about a Dataflow feature

What Recent data changes under pressure

What this feature doesn’t rescue you from

Preview discipline: run this like a survival checklist

The operating model for Spark leads

Share this:

What actually shipped

Why Spark engineers should care

The non-negotiable constraints

Practical rollout plan for Spark teams

1) Inventory duplication first

2) Stand up a dedicated execution dataflow

3) Wrap Execute Query behind one notebook utility

4) Add governance checks before scale

5) Monitor capacity from day one

Where this fits (and where it doesn’t)

The organizational shift hiding behind the API

Source notes

Share this: