What “Execute Power Query Programmatically” Means for Fabric Spark Teams

What "Execute Power Query Programmatically" Means for Fabric Spark Teams

What “Execute Power Query Programmatically” Means for Fabric Spark Teams

Somewhere in a Fabric workspace right now, two teams are maintaining the same transformation twice.

The BI team owns it in Power Query. The Spark team rewrote it in PySpark so a notebook could run it on demand. Both versions work. Both versions drift. Both versions break at different times.

That was normal.

Microsoft’s new Execute Query API (preview) is the first real shot at ending that duplication. It lets you execute Power Query (M) through a public REST API from notebooks, pipelines, or any HTTP client, then stream results back in Apache Arrow format.

For Spark teams, this isn’t a minor feature. It changes where transformation logic can live.

What actually shipped

At a technical level, the API is simple:

  • Endpoint: POST /v1/workspaces/{workspaceId}/dataflows/{dataflowId}/executeQuery
  • Input: a queryName, with optional customMashupDocument (full M script)
  • Output: Arrow stream (application/vnd.apache.arrow.stream)

The execution context comes from a Dataflow Gen2 artifact in your workspace. Its configured connections determine what data sources the query can access and which credentials are used.

That single detail matters more than it looks. You’re not just “calling M from Spark.” You’re running M under dataflow-governed connectivity and permissions.

Why Spark engineers should care

Before this API, Spark teams usually had two options:

  • Rewrite M logic in PySpark
  • Or wait for a dataflow refresh and consume the output later

Neither is great. Rewrites create long-term maintenance debt. Refresh handoffs add latency and orchestration fragility.

Now you can execute the transformation inline and keep moving.

A minimal call path looks like this:

import requests
import pyarrow as pa

response = requests.post(url, headers=headers, json=request_body, stream=True)

with pa.ipc.open_stream(response.raw) as reader:
    pandas_df = reader.read_pandas()

spark_df = spark.createDataFrame(pandas_df)

No CSV hop. No JSON schema drift. No custom parsing layer.

The non-negotiable constraints

This feature is useful, but it is not magic. There are hard boundaries.

  1. 90-second timeout
    – Query evaluations must complete within 90 seconds.
    – This is ideal for fast lookups, enrichment, and reference joins—not heavy batch reshaping.

  2. Read-only execution
    – The API executes queries only. It doesn’t support write actions.
    – If your notebook flow assumes “query + write” in one API step, redesign it.

  3. Native query rule for custom mashups
    customMashupDocument does not allow native database queries.
    – But if a query defined inside the dataflow itself uses native queries, that query can be executed.
    – This distinction will trip people if they treat inline M and stored dataflow queries as equivalent.

  4. Performance depends on folding and query complexity
    – Bad folding or expensive transformations can burn your 90-second window quickly.
    – You need folding-aware query reviews before production rollout.

Practical rollout plan for Spark teams

If you lead a Fabric Spark team, do this in order.

1) Inventory duplication first

Build a short list of transformations currently duplicated between M and PySpark. Start with transformations that are stable, reused often, and mostly read-oriented.

2) Stand up a dedicated execution dataflow

Create one Dataflow Gen2 artifact specifically for API-backed execution contexts.

  • Keep connections explicit and reviewed
  • Restrict who can modify those connections
  • Treat the artifact like infrastructure, not ad hoc workspace clutter

3) Wrap Execute Query behind one notebook utility

Don’t let every notebook hand-roll HTTP logic. Create one shared helper that handles:

  • token acquisition
  • request construction
  • Arrow stream parsing
  • error handling
  • timeout/response logging

If the API returns 202 (long-running operation), honor Location and Retry-After instead of guessing polling behavior.

4) Add governance checks before scale

Because execution runs under dataflow connection scope, validate:

  • who can execute
  • what connections they indirectly inherit
  • which data sources become reachable through that path

If your governance model assumes notebook identity is the only control plane, this API changes that assumption.

5) Monitor capacity from day one

Microsoft surfaces this usage in Capacity Metrics as “Dataflows Gen2 Run Query API”, billed on the same meter family as Dataflow Gen2 refresh operations. Watch this early so you don’t discover new spend after adoption is already wide.

Where this fits (and where it doesn’t)

Use it when you need:

  • shared transformation logic between BI and engineering
  • fast, read-oriented query execution from Spark/pipelines/apps
  • connector and gateway reach already configured in dataflows

Avoid it when you need:

  • long-running transformations
  • write-heavy jobs
  • mission-critical production paths with zero preview risk tolerance

The REST API docs still mark this as preview and “not recommended for production use.” Treat that warning as real, not ceremonial.

The organizational shift hiding behind the API

The technical win is straightforward: fewer rewrites, faster integration, cleaner data handoffs.

The harder change is social.

When Spark notebooks can directly execute M, ownership lines between BI and data engineering need to be explicit. Who owns business logic? Who owns runtime reliability? Who approves connection scope?

Teams that answer those questions early will move fast.

Teams that don’t will just reinvent the same duplication problem with a new endpoint.


Source notes

This post was written with help from anthropic/claude-opus-4-6.

Leave a comment