PySpark – Christopher Finlan

ExtractLabel just changed how your Spark pipelines should handle unstructured data

Every data engineer eventually inherits the same cursed pipeline.

Upstream sends you a blob of human text. Somewhere in that blob are the exact facts your downstream systems need: product name, issue category, requested resolution, timeline, who did what, and when. The facts are there. They are just buried in prose written by sleep-deprived humans, copied from emails, and occasionally typed from a phone in an airport parking lot.

For years, we handled this with a pile of hacks:

Regex that works until one user adds a comma
Hand-rolled NER that drifts quietly into uselessness
LLM prompts that return valid JSON on Monday, improv theater on Tuesday

Then we pretend this is fine by writing 300 lines of “normalization” code downstream, plus defensive checks, plus retry logic, plus enough if statements to make your future self hate your past self.

That is the old world.

ExtractLabel is the first Fabric AI Functions primitive that treats extraction like a contract instead of a vibe. You define the shape once in JSON Schema. The extraction step returns that shape. Your pipeline gets predictable structure instead of model improv.

If you run Spark workloads in Fabric, this matters immediately.

What AI Functions already gave you (and where it fell short)

Before ExtractLabel, the quick path looked like this:

df["text"].ai.extract("name", "profession", "city")

For exploration, that is great. For production, it is a trap.

Prototype extraction asks, “Can the model find useful fields?”
Production extraction asks, “Can every downstream consumer trust type, shape, and vocabulary every single run?”

Those are different questions.

The basic label call is lightweight and convenient, but it leaves the hardest part unsolved: schema discipline. If your routing logic expects one of four categories, free-form output creates entropy. If your analytics expects arrays, and extraction returns comma-separated strings, you are writing cleanup code forever. If optional fields are not explicitly nullable, models tend to fill blanks with plausible nonsense.

The model understanding was never the bottleneck. Contract reliability was.

ExtractLabel: the schema contract your pipeline needs

ExtractLabel gives you an explicit schema boundary between unstructured input and structured output. In pandas you import from synapse.ml.aifunc; in PySpark you import from synapse.ml.spark.aifunc. The core pattern is the same: define one label with object properties, requirements, and constraints.

Concrete example, using warranty claims:

from synapse.ml.aifunc import ExtractLabel  claim_schema = ExtractLabel(     label="claim",     max_items=1,     type="object",     description="Extract structured warranty claim information",     properties={         "type": "object",         "properties": {             "product_name": {"type": "string"},             "problem_category": {                 "type": "string",                 "enum": ["defect", "damage_in_transit", "missing_part", "other"],                 "description": "defect=stopped working or malfunctioning, damage_in_transit=arrived damaged, missing_part=something not included"             },             "problem_summary": {                 "type": "string",                 "description": "Max 20 words. Summarize the core issue."             },             "time_owned": {"type": ["string", "null"]},             "troubleshooting_tried": {                 "type": "array",                 "items": {"type": "string"}             },             "requested_resolution": {                 "type": "string",                 "enum": ["replacement", "refund", "repair", "other"]             }         },         "required": ["product_name", "problem_category", "problem_summary",                      "time_owned", "troubleshooting_tried", "requested_resolution"],         "additionalProperties": False     } )  df[["claim"]] = df["text"].ai.extract(claim_schema)

Input text:

“The smart thermostat stopped turning on after 12 days. I tried a reset and new batteries. Please replace it.”

Structured output:

{     "product_name": "smart thermostat",     "problem_category": "defect",     "problem_summary": "Thermostat stopped turning on after 12 days",     "time_owned": "12 days",     "troubleshooting_tried": ["reset", "new batteries"],     "requested_resolution": "replacement" }

That is the difference: you are no longer extracting “some fields.” You are producing an object your systems can rely on.

The five schema features that actually matter

Most teams will over-focus on “LLM extraction” and under-focus on schema design. That is backwards. The model is only half the system. The schema is what makes it production-safe.

1) Nullable types

Use explicit nullable definitions for fields that may not exist in the source text:

"time_owned": {"type": ["string", "null"]}

If you do not allow null, the model is pressured to invent. Nullable fields reduce that pressure.

2) Enums for category control

When downstream logic expects bounded values, enforce them with enum.

That turns category assignment from fuzzy language output into controlled vocabulary. If your pipeline routes by problem_category, this is non-negotiable.

3) Arrays for true multi-value extraction

If a claim can include multiple troubleshooting actions, represent it as an array. Do not accept packed strings and split later.

Array semantics belong in extraction, not in cleanup jobs.

4) Descriptions as extraction instructions

Descriptions are not decorative comments. They are guidance for the extraction step.

Use them to define edge behavior, clarify enum intent, and enforce concise summaries. Most quality gains come from this field, not from prompt wording elsewhere.

5) Nested objects for real-world structure

Complex payloads are rarely flat. If your domain includes sub-entities, model them as nested objects now. Flattening everything into top-level strings feels easier in week one and becomes technical debt by week six.

What this means for your Spark pipelines right now

If your team already runs text extraction in Fabric pipelines, ExtractLabel gives you a clean migration path with immediate payback in reliability.

Practical rollout plan:

Find the pain first. Audit extraction steps where downstream code spends time repairing output shape, casing, and categories. Those are your highest-ROI migrations.
Version schemas like code. Store schema definitions in source control with explicit version tags. Treat schema changes as contract changes, not casual edits.
Use one extraction contract per domain task. Do not build one giant universal schema. Warranty claims, support tickets, and contract clauses deserve separate schemas with domain-specific enums and guidance.
Prefer model-based schema authoring as complexity grows. Once schemas get large, hand-editing JSON gets brittle. Define structures in typed Python models and generate JSON Schema from there. You get stronger review discipline and fewer silent mistakes.
Build an evaluation harness before broad rollout. ExtractLabel enforces structure; it does not guarantee semantic correctness. Keep a labeled sample set, score extraction quality regularly, and review drift.
Tune operational settings with real workload telemetry. Concurrency, retry behavior, and throughput limits should be validated in your environment, not assumed from defaults. Measure error columns and latency under realistic load before declaring victory.

Verify runtime, capacity, and governance prerequisites against current Fabric documentation in your tenant before rollout. Platform details move. Your production runbooks should not rely on stale assumptions.

Migration risks worth thinking about

ExtractLabel is strong, but this is still LLM-powered extraction. You need grown-up operating discipline.

Model behavior drift

Even with stable schema shape, semantic interpretation can shift over time. A phrase that mapped to defect last month might map to other after a model update.

Mitigation: maintain a regression set and run periodic quality checks. Contract shape is necessary. Accuracy monitoring is mandatory.

Cost surprises at volume

Row-wise AI extraction scales linearly with data volume. Teams underestimate this, then panic when ingestion spikes.

Mitigation: test on representative daily volume, not a toy sample. Budget for peak days, not median days.

Schema evolution pain

You will add fields. You will split categories. You will regret one enum name. That is normal.

Mitigation: include schema version metadata in outputs and plan how downstream consumers handle mixed historical versions.

False confidence from “valid JSON”

Teams see valid typed output and stop questioning semantics. That is how bad extractions get into trusted dashboards.

Mitigation: sample manually, review periodically, and keep humans in the QA loop for high-impact fields.

When to use ExtractLabel vs. other approaches

Use ExtractLabel when all of these are true:

Input is unstructured text
Output must be typed and schema-conforming
You need extraction embedded in Fabric data workflows

Keep regex when the task is deterministic and mechanical (IDs, fixed-format dates, known token patterns).

Keep specialized NER pipelines when domain vocabulary is unusual, latency requirements are strict, or inference cost constraints are severe.

Use document-native extraction tools when layout matters (forms, scans, tables in images/PDFs). Text-column extraction will not recover geometry it never saw.

If your instinct is “we can just prompt harder,” stop. That is how you build a fragile system that passes demos and fails operations.

The bottom line

ExtractLabel moves Fabric extraction from improvisation to contracts.

The shiny part is one line of code:

df[["claim"]] = df["text"].ai.extract(claim_schema)

The valuable part is everything you encode in the schema: allowed values, nullability, nested structure, and descriptive guidance for edge cases.

Do that work once, and your downstream pipeline stops behaving like a cleanup crew.

Less duct tape, more reliable data.

This post was written with help from anthropic/claude-opus-4-6