Spark tuning has a way of chewing up time: you start with something that “should be fine,” performance is off, costs creep up, and suddenly you’re deep in configs, Spark UI, and tribal knowledge trying to figure out what actually matters.
That’s why I’m excited to highlight sparkwise, an open-source Python package created by Santhosh Kumar Ravindran, one of my direct reports here at Microsoft. Santhosh built sparkwise to make Spark optimization in Microsoft Fabric less like folklore and more like a repeatable workflow: automated diagnostics, session profiling, and actionable recommendations to help teams drive better price-performance without turning every run into an investigation.
If you’ve ever thought, “I know something’s wrong, but I can’t quickly prove what to change,” sparkwise is aimed squarely at that gap. (PyPI)
As of January 5, 2026, the latest release is sparkwise 1.4.2 on PyPI. (PyPI)
The core idea: stop guessing, start diagnosing
Spark tuning often fails for two reasons:
Too many knobs (Spark, Delta, Fabric-specific settings, runtime behavior).
Not enough feedback (it’s hard to translate symptoms into the few changes that actually matter).
sparkwise attacks both.
It positions itself as an “automated Data Engineering specialist for Apache Spark on Microsoft Fabric,” offering:
Intelligent diagnostics
Configuration recommendations
Comprehensive session profiling …so you can get to the best price/performance outcome without turning every notebook run into a science project. (PyPI)
Why sparkwise exists (and the problems it explicitly targets)
From the project description, sparkwise focuses on the stuff that reliably burns time and money in real Fabric Spark work:
Cost optimization: detect configurations that waste capacity and extend runtime (PyPI)
Performance optimization: validate and enable Fabric-specific acceleration paths like Native Engine and resource profiles (PyPI)
Faster iteration: detect Starter Pool blockers that force slower cold starts (3–5 minutes is called out directly) (PyPI)
Learning & clarity: interactive Q&A across 133 Spark/Delta/Fabric configurations (PyPI)
Workload understanding: profiling across sessions, executors, jobs, and resources (PyPI)
Decision support: priority-ranked recommendations with impact analysis (PyPI)
If you’ve ever thought “I know something is off, but I can’t prove which change matters,” this is aimed squarely at you.
What you get: a feature tour that maps to real-world Spark pain
sparkwise’s feature set is broad, but it’s not random. It clusters nicely into a few “jobs to be done.”
1) Automated diagnostics (the fast “what’s wrong?” pass)
The diagnostics layer checks a bunch of high-impact areas, including:
Native Execution Engine: verifies Velox usage and detects fallbacks to row-based processing (PyPI)
Spark compute: analyzes Starter vs Custom Pool usage and flags immutable configs (PyPI)
Data skew detection: identifies imbalanced task distributions (PyPI)
That’s a clean “ops loop” for keeping Delta tables healthy.
A realistic “first hour” workflow I’d recommend
If you’re trying sparkwise on a real Fabric notebook today, here’s a practical order of operations:
Run diagnose.analyze() first Use it as your “triage” to catch the high-impact misconfigs (Native Engine fallback, AQE off, Starter Pool blockers). (PyPI)
Use ask.config() for any red/yellow item you don’t fully understand The point is speed: read the explanation in context and decide. (PyPI)
Profile the session If the job is still slow/expensive after obvious fixes, profile and look for the real culprit: skew, shuffle pressure, poor parallelism, memory pressure. (PyPI)
If the job smells like skew, use advanced skew detection Especially for joins and wide aggregations. (PyPI)
If your tables are growing, run storage analysis early Small files and weak partitioning quietly tax everything downstream. (PyPI)
That flow is how you turn “tuning” from an art project into a checklist.
Closing: why this matters for Fabric teams
I’m amplifying sparkwise because it’s the kind of contribution that scales beyond the person who wrote it. Santhosh took hard-earned, real-world Fabric Spark tuning experience and turned it into something other engineers can use immediately — a practical way to spot waste, unblock faster iteration, and make smarter performance tradeoffs.
If your team runs Fabric Spark workloads regularly, treat sparkwise like a lightweight tuning partner:
install it,
run the diagnostics,
act on one recommendation,
measure the improvement,
repeat.
And if you end up with feedback or feature ideas, even better — that’s how tools like this get sharper and more broadly useful.
Microsoft Fabric makes it incredibly easy to spin up Spark workloads: notebooks, Lakehouse pipelines, dataflows, SQL + Spark hybrid architectures—the whole buffet.
What’s still hard? Knowing why a given Spark job is slow, expensive, or flaky.
A Lakehouse pipeline starts timing out.
A notebook that used to finish in 5 minutes is now taking 25.
Costs spike because one model training job is shuffling half the lake.
You open the Spark UI, click around a few stages, stare at shuffle graphs, and say the traditional words of Spark debugging:
“Huh.”
This is where an AI assistant should exist.
In this post, we’ll walk through how to build exactly that for Fabric Spark: a Job Doctor that:
Reads Spark telemetry from your Fabric environment
Detects issues like skew, large shuffles, spill, and bad configuration
Uses a large language model (LLM) to explain what went wrong
Produces copy-pasteable fixes in Fabric notebooks / pipelines
Runs inside Fabric using Lakehouses, notebooks, and Azure AI models
This is not a fake product announcement. This is a blueprint you can actually build.
What Is the Fabric “Job Doctor”?
At a high level, the Job Doctor is:
A Fabric-native analytics + AI layer that continuously reads Spark job history, detects common performance anti-patterns, and generates human-readable, prescriptive recommendations.
Or an EventLog record with a payload that looks like the Spark listener event.
To build a Job Doctor, you’ll:
Read the JSON lines into Fabric Spark
Explode / parse the properties payload
Aggregate per-task metrics into per-stage metrics for each application
We’ll skip the exact parsing details (they depend on how you set up the emitter and which events/metrics you enable) and assume that after a normalization job, you have a table with one row per (applicationId, stageId, taskId).
That’s what the next sections use.
3. Capturing Query Plans in Fabric (Optional, but Powerful)
Spark query plans are gold when you’re trying to answer why a stage created a huge shuffle or why a broadcast join didn’t happen.
There isn’t yet a first-class “export query plan as JSON” API in PySpark, but in Fabric notebooks you can use a (semi-internal) trick that works today:
import json
df = ... # some DataFrame you care about
# Advanced / internal: works today but isn't a public, stable API
plan_json = json.loads(df._jdf.queryExecution().toJSON())
You can also log the human-readable plan:
df.explain(mode="formatted") # documented mode, prints a detailed plan
To persist the JSON plan for the Job Doctor, tie it to the Spark application ID:
properties (nested JSON with stage/task/metric detail)
The normalization step (which you can run as a scheduled pipeline) should:
Filter down to metrics/events relevant for performance (e.g. task / stage metrics)
Extract stageId, taskId, executorRunTime, shuffleReadBytes, etc., into top-level columns
Persist the result as job_doctor.task_metrics (or similar)
For the rest of this post, we’ll assume you’ve already done that and have a table with columns:
applicationId
stageId
taskId
executorRunTime
shuffleReadBytes
shuffleWriteBytes
memoryBytesSpilled
diskBytesSpilled
Aggregating Stage Metrics in Fabric
Now we want to collapse per-task metrics into per-stage metrics per application.
In a Fabric notebook:
from pyspark.sql import functions as F
task_metrics = spark.table("job_doctor.task_metrics")
stage_metrics = (
task_metrics
.groupBy("applicationId", "stageId")
.agg(
F.countDistinct("taskId").alias("num_tasks"),
F.sum("executorRunTime").alias("total_task_runtime_ms"),
# Depending on Spark version, you may need percentile_approx instead
F.expr("percentile(executorRunTime, 0.95)").alias("p95_task_runtime_ms"),
F.max("executorRunTime").alias("max_task_runtime_ms"),
F.sum("shuffleReadBytes").alias("shuffle_read_bytes"),
F.sum("shuffleWriteBytes").alias("shuffle_write_bytes"),
F.sum("memoryBytesSpilled").alias("memory_spill_bytes"),
F.sum("diskBytesSpilled").alias("disk_spill_bytes"),
)
.withColumn(
"skew_ratio",
F.col("max_task_runtime_ms") /
F.when(F.col("p95_task_runtime_ms") == 0, 1).otherwise(F.col("p95_task_runtime_ms"))
)
.withColumn("shuffle_read_mb", F.col("shuffle_read_bytes") / (1024**2))
.withColumn("shuffle_write_mb", F.col("shuffle_write_bytes") / (1024**2))
.withColumn(
"spill_mb",
(F.col("memory_spill_bytes") + F.col("disk_spill_bytes")) / (1024**2)
)
)
stage_metrics.write.mode("overwrite").saveAsTable("job_doctor.stage_metrics")
This gives you a Fabric Lakehouse table with:
skew_ratio
shuffle_read_mb
shuffle_write_mb
spill_mb
p95_task_runtime_ms
num_tasks, total_task_runtime_ms, etc.
You can run this notebook:
On a schedule via a Data Pipeline
Or as a Data Engineering job configured in the workspace
Part 3: Adding a Rule Engine Inside Fabric
Now that the metrics are in a Lakehouse table, let’s add a simple rule engine in Python.
This will run in a Fabric notebook (or job) and write out issues per stage.
from pyspark.sql import Row, functions as F
stage_metrics = spark.table("job_doctor.stage_metrics")
# For simplicity, we'll collect to the driver here.
# This is fine if you don't have thousands of stages.
# For very large workloads, you'd instead do this via a UDF / mapInPandas / explode.
stage_rows = stage_metrics.collect()
Define some basic rules:
def detect_issues(stage_row):
issues = []
# 1. Skew detection
if stage_row.skew_ratio and stage_row.skew_ratio > 5:
issues.append({
"issue_id": "SKEWED_STAGE",
"severity": "High",
"details": f"Skew ratio {stage_row.skew_ratio:.1f}"
})
# 2. Large shuffle
total_shuffle_mb = (stage_row.shuffle_read_mb or 0) + (stage_row.shuffle_write_mb or 0)
if total_shuffle_mb > 10_000: # > 10 GB
issues.append({
"issue_id": "LARGE_SHUFFLE",
"severity": "High",
"details": f"Total shuffle {total_shuffle_mb:.1f} MB"
})
# 3. Excessive spill
if (stage_row.spill_mb or 0) > 1_000: # > 1 GB
issues.append({
"issue_id": "EXCESSIVE_SPILL",
"severity": "Medium",
"details": f"Spill {stage_row.spill_mb:.1f} MB"
})
return issues
Apply the rules and persist the output:
issue_rows = []
for r in stage_rows:
for issue in detect_issues(r):
issue_rows.append(Row(
applicationId=r.applicationId,
stageId=r.stageId,
issue_id=issue["issue_id"],
severity=issue["severity"],
details=issue["details"]
))
issues_df = spark.createDataFrame(issue_rows)
issues_df.write.mode("overwrite").saveAsTable("job_doctor.stage_issues")
Now you have a table of Spark issues detected per run inside your Lakehouse.
Later, the LLM will use these as structured hints.
Part 4: Bringing in the LLM — Turning Metrics into Diagnosis
So far, everything has been pure Spark in Fabric.
Now we want a model (e.g., Azure AI “Models as a Service” endpoint or Azure OpenAI) to turn:
job_doctor.stage_metrics
job_doctor.stage_issues
job_doctor.spark_conf
job_doctor.query_plans
into an actual diagnosis sheet a human can act on.
In Fabric, this is simplest from a Spark notebook using a Python HTTP client.
Below, I’ll show the pattern using an Azure AI serverless model endpoint (the one that uses model: "gpt-4.1" in the body).
1. Prepare the Prompt Payload
First, fetch the data for a single Spark application:
import json
from pyspark.sql import functions as F
app_id = "app-20240501123456-0001" # however you pick which run to diagnose
stages_df = spark.table("job_doctor.stage_metrics").where(F.col("applicationId") == app_id)
issues_df = spark.table("job_doctor.stage_issues").where(F.col("applicationId") == app_id)
conf_df = spark.table("job_doctor.spark_conf").where(F.col("applicationId") == app_id)
plans_df = spark.table("job_doctor.query_plans").where(F.col("applicationId") == app_id)
stages_json = stages_df.toPandas().to_dict(orient="records")
issues_json = issues_df.toPandas().to_dict(orient="records")
conf_json = conf_df.toPandas().to_dict(orient="records")
plans_json = plans_df.toPandas().to_dict(orient="records") # likely 0 or 1 row
Then build a compact but informative prompt:
prompt = f"""
You are an expert in optimizing Apache Spark jobs running on Microsoft Fabric.
Here is summarized telemetry for one Spark application (applicationId={app_id}):
Stage metrics (JSON):
{json.dumps(stages_json, indent=2)}
Detected issues (JSON):
{json.dumps(issues_json, indent=2)}
Spark configuration (key/value list):
{json.dumps(conf_json, indent=2)}
Query plans (optional, may be empty):
{json.dumps(plans_json, indent=2)}
Your tasks:
1. Identify the top 3–5 performance issues for this run.
2. For each, explain the root cause in plain language.
3. Provide concrete fixes tailored for Fabric Spark, including:
- spark.conf settings (for notebooks/jobs)
- suggestions for pipeline settings where relevant
- SQL/DataFrame code snippets
4. Estimate likely performance impact (e.g., "30–50% reduction in runtime").
5. Call out any risky or unsafe changes that should be tested carefully.
Return your answer as markdown.
"""
2. Call an Azure AI Model from Fabric Spark
For the serverless “Models as a Service” endpoint, the pattern looks like this:
import os
import requests
# Example: using Azure AI Models as a Service
# AZURE_AI_ENDPOINT might look like: https://models.inference.ai.azure.com
AZURE_AI_ENDPOINT = os.environ["AZURE_AI_ENDPOINT"]
AZURE_AI_KEY = os.environ["AZURE_AI_KEY"]
MODEL = "gpt-4.1" # or whatever model you've enabled
headers = {
"Content-Type": "application/json",
"api-key": AZURE_AI_KEY,
}
body = {
"model": MODEL,
"messages": [
{"role": "system", "content": "You are a helpful assistant for optimizing Spark jobs on Microsoft Fabric."},
{"role": "user", "content": prompt},
],
}
resp = requests.post(
f"{AZURE_AI_ENDPOINT}/openai/chat/completions",
headers=headers,
json=body,
)
resp.raise_for_status()
diagnosis = resp.json()["choices"][0]["message"]["content"]
If you instead use a provisioned Azure OpenAI resource, the URL shape is slightly different (you call /openai/deployments/<deploymentName>/chat/completions and omit the model field), but the rest of the logic is identical.
At this point, diagnosis is markdown you can:
Render inline in the notebook with displayHTML
Save into a Lakehouse table
Feed into a Fabric semantic model for reporting
Part 5: What the Job Doctor’s Output Looks Like in Fabric
A good Job Doctor output for Fabric Spark might look like this (simplified):
🔎 Issue 1: Skewed Stage 4 (skew ratio 12.3)
What I see
Stage 4 has a skew ratio of 12.3 (max task runtime vs. p95).
This stage also reads ~18.2 GB via shuffle, which amplifies the imbalance.
Likely root cause
A join or aggregation keyed on a column where a few values dominate (e.g. a “default” ID, nulls, or a small set of hot keys). One partition ends up doing far more work than the others.
Fabric-specific fixes
In your notebook or job settings, enable Adaptive Query Execution and skew join handling:
If the query is in SQL (Lakehouse SQL endpoint), enable AQE at the session/job level through Spark configuration.
If one side of the join is a small dimension table, add a broadcast hint:
SELECT /*+ BROADCAST(dim) */ f.*
FROM fact f
JOIN dim
ON f.key = dim.key;
Estimated impact: 30–50% reduction in total job runtime, depending on how skewed the key distribution is.
📦 Issue 2: Large Shuffle in Stage 2 (~19.7 GB)
What I see
Stage 2 reads ~19.7 GB via shuffle.
Shuffle partitions are set to 200 (Spark default).
Likely root cause
A join or aggregation is shuffling nearly the full dataset, but parallelism is low given the data volume. That leads to heavy tasks and increased risk of spill.
For pipelines, set this at the Spark activity level under Spark configuration, or through your Fabric environment’s resource profile if you want a new default.
Also consider partitioning by the join key earlier in the pipeline:
df = df.repartition("customer_id")
Estimated impact: More stable runtimes and reduced likelihood of spill; wall-clock improvements if your underlying capacity has enough cores.
💾 Issue 3: Spill to Disk (~1.8 GB) in Stage 3
What I see
Stage 3 spills ~1.8 GB to disk.
This correlates with under-parallelism or memory pressure.
Fabric-specific fixes
Adjust cluster sizing via Fabric capacity / resource profiles (enough cores + memory per core).
Increase spark.sql.shuffle.partitions as above.
Avoid wide transformations producing huge intermediate rows early in the job; materialize smaller, more selective intermediates first.
You can persist the diagnosis text into a table:
from pyspark.sql import Row
spark.createDataFrame(
[Row(applicationId=app_id, diagnosis_markdown=diagnosis)]
).write.mode("append").saveAsTable("job_doctor.diagnoses")
Then you can build a Power BI report in Fabric bound to:
job_doctor.diagnoses
job_doctor.stage_metrics
job_doctor.stage_issues
to create a “Spark Job Health” dashboard where:
Rows = recent Spark runs
Columns = severity, duration, shuffle size, spill, etc.
A click opens the AI-generated diagnosis for that run
All inside the same workspace.
Part 6: Stitching It All Together in Fabric
Let’s recap the full Fabric-native architecture.
1. Telemetry Ingestion (Environment / Emitter)
Configure a Fabric environment for your Spark workloads.
Add a Fabric Apache Spark diagnostic emitter to send logs/metrics to:
Azure Storage (for Lakehouse shortcuts), or
Log Analytics / Event Hubs if you prefer KQL or streaming paths.
(Optional) From notebooks/pipelines, capture:
Spark configs → job_doctor.spark_conf
Query plans → job_doctor.query_plans
2. Normalization Job (Spark / Data Pipeline)
Read raw diagnostics from Storage via a Lakehouse shortcut.
Parse and flatten the records into per-task metrics.
For each new (or most expensive / slowest) application:
Pull stage metrics, issues, configs, and query plans from Lakehouse.
Construct a structured prompt.
Call your Azure AI / Azure OpenAI endpoint from a Fabric Spark notebook.
Store the markdown diagnosis in job_doctor.diagnoses.
4. User Experience
Fabric Notebook
A “Run Job Doctor” cell or button that takes applicationId, calls the model, and displays the markdown inline.
Data Pipeline / Job
Scheduled daily to scan all runs from yesterday and generate diagnoses automatically.
Power BI Report in Fabric
“Spark Job Health” dashboard showing:
Top slowest/most expensive jobs
Detected issues (skew, large shuffle, spill, config problems)
AI recommendations, side-by-side with raw metrics
Everything lives in one Fabric workspace, using:
Lakehouses for data
Spark notebooks / pipelines for processing
Azure AI models for reasoning
Power BI for visualization
Why a Fabric-Specific Job Doctor Is Worth Building
Spark is Spark, but in Fabric the story is different:
Spark jobs are tied closely to Lakehouses, Pipelines, Dataflows, and Power BI.
You already have a single control plane for capacity, governance, cost, and monitoring.
Logs, metrics, and reports can live right next to the workloads they describe.
That makes Fabric an ideal home for a Job Doctor:
No extra infrastructure to stand up
No random side services to glue together
The telemetry you need is already flowing; you just have to catch and shape it
AI can sit directly on top of your Lakehouse + monitoring data
With some Spark, a few Lakehouse tables, and an LLM, you can give every data engineer and analyst in your organization a “Spark performance expert” that’s always on call.
I’ve included a sample notebook you can use to get started on your Job Doctor today!
This post was created with help from (and suggested to me) by ChatGPT Pro using the 5.1 Thinking Model
Microsoft Fabric notebooks are a versatile tool for developing Apache Spark jobs and machine learning experiments. They provide a web-based interactive surface for writing code with rich visualizations and Markdown text support.
In this blog post, we’ll walk through how to call the OpenAI API from a Microsoft Fabric notebook.
Preparing the Notebook
Start by creating a new notebook in Microsoft Fabric. Notebooks in Fabric consist of cells, which are individual blocks of code or text that can be run independently or as a group. You can add a new cell by hovering over the space between two cells and selecting ‘Code’ or ‘Markdown’.
Microsoft Fabric notebooks support four Apache Spark languages: PySpark (Python), Spark (Scala), Spark SQL, and SparkR. For this guide, we’ll use PySpark (Python) as the primary language.
You can specify the language for each cell using magic commands. For example, you can write a PySpark query using the %%pyspark magic command in a Scala notebook. But since our primary language is PySpark, we won’t need a magic command for Python cells.
Microsoft Fabric notebooks are integrated with the Monaco editor, which provides IDE-style IntelliSense for code editing, including syntax highlighting, error marking, and automatic code completions.
Calling the OpenAI API
To call the OpenAI API, we’ll first need to install the OpenAI Python client in our notebook. Add a new cell to your notebook and run the following command:
!pip install openai
Next, in a new cell, write the Python code to call the OpenAI API:
import openai
openai.api_key = 'your-api-key'
response = openai.Completion.create(
engine="text-davinci-002",
prompt="Translate the following English text to French: '{}'",
max_tokens=60
)
print(response.choices[0].text.strip())
Replace 'your-api-key' with your actual OpenAI API key. The prompt parameter is the text you want the model to generate from. The max_tokens parameter is the maximum length of the generated text.
You can run the code in a cell by hovering over the cell and selecting the ‘Run Cell’ button or bypressing Ctrl+Enter. You can also run all cells in sequence by selecting the ‘Run All’ button.
Wrapping Up
That’s it! You’ve now called the OpenAI API from a Microsoft Fabric notebook. You can use this method to leverage the powerful AI models of OpenAI in your data science and machine learning experiments.
Always remember that if a cell is running for a longer time than expected, or you wish to stop execution for any reason, you can select the ‘Cancel All’ button to cancel the running cells or cells waiting in the queue.
I hope this guide has been helpful. Happy coding!
Please note that OpenAI’s usage policies apply when using their API. Be sure to understand these policies before using the API in your projects. Also, keep in mind that OpenAI’s API is a paid service, so remember to manage your usage to control costs.
Finally, it’s essential to keep your API key secure. Do not share it publicly or commit it in your code repositories. If you suspect that your API key has been compromised, generate a new one through the OpenAI platform.
This blogpost was created with help from ChatGPT Pro
Microsoft Fabric is a powerful tool for data engineers, enabling them to build out a lakehouse architecture for their organizational data. In this blog post, we will walk you through the key experiences that Microsoft Fabric.
Creating a Lakehouse
A lakehouse is a new experience that combines the power of a data lake and a data warehouse. It serves as a central repository for all Fabric data. To create a lakehouse, you start by creating a new lakehouse artifact and giving it a name. Once created, you land in the empty Lakehouse Explorer.
Importing Data into the Lakehouse
There are several ways to bring data into the lakehouse. You can upload files and folders from your local machine, use data flows (a low-code tool with hundreds of connectors), or leverage the pipeline copy activity to bring in petabytes of data at scale. Most of the marketing data in the lakehouse is in Delta tables, which are automatically created with no additional effort. You can easily explore the tables, see their schema, and even view the underlying files.
Adding Unstructured Data
In addition to structured data, you might want to add some unstructured customer reviews to accompany your campaign data. If this data already exists in storage, you can simply point to it with no data movement necessary. This is done by adding a new shortcut, which allows you to create a virtual table and virtual files inside your lakehouse. Shortcuts enable you to select from a variety of sources, including lakehouses and warehouses in Fabric, but also external storage like ADLS Gen 2 and even Amazon S3.
Leveraging the Data
Once all your data is ready in the lakehouse, there are many ways to use it. As a data engineer or data scientist, you can open up the lakehouse in a notebook and leverage Spark to continue transforming the data or build a machine learning model. As a SQL professional, you can navigate to the SQL endpoint of the lakehouse where you can write SQL queries, create views and functions, all on top of the same Delta tables. As a business analyst, you can navigate to the built-in modeling view and start developing your BI data model directly in the same warehouse experience.
Configuring your Spark Environment
As an administrator, you can configure the Spark environment for your data engineers. This is done in the capacity admin portal, where you can access the Spark compute settings for data engineers and data scientists. You can set a default runtime and default Spark properties, and also turn on the ability for workspace admins to configure their own custom Spark pools.
Collaborative Data Development
Microsoft Fabric also provides a rich developer experience, enabling users to collaborate easily, work with their lakehouse data, and leverage the power of Spark. You can view your colleagues’ code updates in real time, install ML libraries for your project, and use the built-in charting capabilities to explore your data. The notebook has a built-in resource folder which makes it easy to store scripts or other code files you might need for the project.
In conclusion, Microsoft Fabric provides a frictionless experience for data engineers building out their enterprise data lakehouse and can easily democratize this data for all users in an organization. It’s a powerful tool that combines the power of a data lake and a data warehouse, providing a comprehensive solution for data engineering tasks.
This blogpost was created with help from ChatGPT Pro
Spark Compute is a key component of Microsoft Fabric, the end-to-end, unified analytics platform that brings together all the data and analytics tools that organizations need. Spark Compute enables data engineering and data science scenarios on a fully managed Spark compute platform that delivers unparalleled speed and efficiency.
What is Spark Compute?
Spark Compute is a way of telling Spark what kind of resources you need for your data analysis tasks. You can give your Spark pool a name, and choose how many and how big the nodes (the machines that do the work) are. You can also tell Spark how to adjust the number of nodes depending on how much work you have.
Spark Compute operates on OneLake, the data lake service that powers Microsoft Fabric. OneLake provides a single place to store and access all your data, whether it is structured, semi-structured, or unstructured. OneLake also supports data from other sources, such as Amazon S3 and (soon) Google Cloud Platform³.
Spark Compute supports both batch and streaming scenarios, and integrates with various tools and frameworks, such as Azure OpenAI Service, Azure Machine Learning, Databricks, Delta Lake, and more. You can use Spark Compute to perform data ingestion, transformation, exploration, analysis, machine learning, and AI tasks on your data.
How to use Spark Compute?
There are two ways to use Spark Compute in Microsoft Fabric: starter pools and custom pools.
Starter pools
Starter pools are a fast and easy way to use Spark on the Microsoft Fabric platform within seconds. You can use Spark sessions right away, instead of waiting for Spark to set up the nodes for you. This helps you do more with data and get insights quicker.
Starter pools have Spark clusters that are always on and ready for your requests. They use medium nodes that will dynamically scale-up based on your Spark job needs. Starter pools also have default settings that let you install libraries quickly without slowing down the session start time.
You only pay for starter pools when you are using Spark sessions to run queries. You don’t pay for the time when Spark is keeping the nodes ready for you.
Custom pools
A custom pool is a way of creating a tailored Spark pool according to your specific data engineering and data science requirements. You can customize various aspects of your custom pool, such as:
Node size: You can choose from different node sizes that offer different combinations of CPU cores, memory, and storage.
Node count: You can specify the minimum and maximum number of nodes you want in your custom pool.
Autoscale: You can enable autoscale to let Spark automatically adjust the number of nodes based on the workload demand.
Dynamic allocation: You can enable dynamic allocation to let Spark dynamically allocate executors (the processes that run tasks) based on the workload demand.
Libraries: You can install libraries from various sources, such as Maven, PyPI, CRAN, or your workspace.
Properties: You can configure custom properties for your custom pool, such as spark.executor.memory or spark.sql.shuffle.partitions.
Creating a custom pool is free; you only pay when you run a Spark job on the pool. If you don’t use your custom pool for 2 minutes after your job is done, Spark will automatically delete it. This is called the \”time to live\” property, and you can change it if you want.
If you are a workspace admin, you can also create default custom pools for your workspace, and make them the default option for other users. This way, you can save time and avoid setting up a new custom pool every time you run a notebook or a Spark job.
Custom pools take about 3 minutes to start, because Spark has to get the nodes from Azure.
Conclusion
Spark Compute is a powerful and flexible way of using Spark on Microsoft Fabric. It enables you to perform various data engineering and data science tasks on your data stored in OneLake or other sources. It also offers different options for creating and managing your Spark pools according to your needs and preferences.
If you want to learn more about Spark Compute in Microsoft Fabric, check out these resources:
Have questions about Microsoft Fabric? Here’s a quick FAQ to help you out:
Q: What is Microsoft Fabric? A: Microsoft Fabric is an end-to-end, unified analytics platform that brings together all the data and analytics tools that organizations need. Fabric integrates technologies like Azure Data Factory, Azure Synapse Analytics, and Power BI into a single unified product, empowering data and business professionals alike to unlock the potential of their data and lay the foundation for the era of AI.
Q: What are the benefits of using Microsoft Fabric? A: Some of the benefits of using Microsoft Fabric are:
It simplifies analytics by providing a single product with a unified experience and architecture that provides all the capabilities required for a developer to extract insights from data and present it to the business user.
It enables faster innovation by helping every person in your organization act on insights from within Microsoft 365 apps, such as Microsoft Excel and Microsoft Teams.
It reduces costs by eliminating data sprawl and creating custom views for everyone.
It supports open and scalable solutions that give data stewards additional control with built-in security, governance, and compliance.
It accelerates analysis by developing AI models on a single foundation without data movement —reducing the time data scientists need to deliver value.
Q: How can I get started with Microsoft Fabric? A: You can get started with Microsoft Fabric by signing up for a free trial here: https://www.microsoft.com/microsoft-fabric/try-for-free. You will get a fixed Fabric trial capacity for each business user, which may be used for any feature or capability.
Q: What are the main components of Microsoft Fabric? A: The main components of Microsoft Fabric are:
Unified data foundation: A data lake-centric hub that helps data engineers connect and curate data from different sources—eliminating sprawl and creating custom views for everyone¹.
Role-tailored tools: A set of tools that cater to different roles in the analytics process, such as data engineering, data warehousing, data science, real-time analytics, and business intelligence.
AI-powered capabilities: A set of capabilities that leverage generative AI and language model services, such as Azure OpenAI Service, to enable customers to use and create everyday AI experiences that are reinventing how employees spend their time¹.
Open, governed foundation: A foundation that supports open standards and formats, such as Apache Spark, SQL, Python, R, and Parquet, and provides robust data security, governance, and compliance features.
Cost management: A feature that helps customers optimize their spending on Fabric by providing visibility into their usage and costs across different services and resources.
Q: How does Microsoft Fabric integrate with other Microsoft products? A: Microsoft Fabric integrates seamlessly with other Microsoft products, such as:
Microsoft 365: Users can access insights from Fabric within Microsoft 365 apps, such as Excel and Teams, using natural language queries or pre-built templates.
Azure OpenAI Service: Users can leverage generative AI and language model services from Azure OpenAI Service to create everyday AI experiences within Fabric.
Azure Data Explorer: Users can ingest, store, analyze, and visualize massive amounts of streaming data from various sources using Azure Data Explorer within Fabric.
Azure IoT Hub: Users can connect millions of devices and stream real-time data to Fabric using Azure IoT Hub.
Q: How does Microsoft Fabric compare with other analytics platforms? A: Microsoft Fabric differs from other analytics platforms in several ways:
It is an end-to-end analytics product that addresses every aspect of an organization’s analytics needs with a single product and a unified experience.
It is a SaaS product that is automatically integrated and optimized, and users can sign up within seconds and get real business value within minutes.
It is an AI-powered platform that leverages generative AI and language model services to enable customers to use and create everyday AI experiences.
It is an open and scalable platform that supports open standards and formats, and provides robust data security, governance, and compliance features.
Q: Who are the target users of Microsoft Fabric? A: Microsoft Fabric is designed for enterprises that want to transform their data into a competitive advantage. It caters to different roles in the analytics process, such as:
Data engineers: They can use Fabric to connect and curate data from different sources, create custom views for everyone, and manage powerful AI models without data movement.
Data warehousing professionals: They can use Fabric to build scalable data warehouses using SQL or Apache Spark, perform complex queries across structured and unstructured data sources, and optimize performance using intelligent caching.
Data scientists: They can use Fabric to develop AI models using Python or R on a single foundation without data movement, leverage generative AI and language model services from Azure OpenAI Service, and deploy models as web services or APIs.
Data analysts: They can use Fabric to explore and analyze data using SQL or Apache Spark notebooks or Power BI Desktop within Fabric, create rich visualizations using Power BI Embedded within Fabric or Power BI Online outside of Fabric.
Business users: They can use Fabric to access insights from within Microsoft 365 apps using natural language queries or pre-built templates, or use Power BI Online outside of Fabric to consume reports or dashboards created by analysts.
Data science is the process of extracting insights from data using various methods and techniques, such as statistics, machine learning, and artificial intelligence. Data science can help organizations solve complex problems, optimize processes, and create new opportunities.
However, data science is not an easy task. It involves multiple steps and challenges, such as:
Finding and accessing relevant data sources
Exploring and understanding the data
Cleaning and transforming the data
Experimenting and building machine learning models
Deploying and operationalizing the models
Communicating and presenting the results
To perform these steps effectively, data scientists need a powerful and flexible platform that can support their end-to-end workflow and enable them to collaborate with other roles, such as data engineers, analysts, and business users.
This is where Microsoft Fabric comes in.
Microsoft Fabric is an end-to-end, unified analytics platform that brings together all the data and analytics tools that organizations need. Fabric integrates technologies like Azure Data Factory, Azure Synapse Analytics, and Power BI into a single unified product, empowering data and business professionals alike to unlock the potential of their data and lay the foundation for the era of AI¹.
In this blogpost, I will focus on how Microsoft Fabric offers a rich and comprehensive Data Science experience that can help data scientists complete their tasks faster and easier.
The Data Science experience in Microsoft Fabric
The Data Science experience in Microsoft Fabric consists of multiple native-built features that enable collaboration, data acquisition, sharing, and consumption in a seamless way. In this section, I will describe some of these features and how they can help data scientists in each step of their workflow.
Data discovery and pre-processing
The first step in any data science project is to find and access relevant data sources. Microsoft Fabric users can interact with data in OneLake using the Lakehouse item. Lakehouse easily attaches to a Notebook to browse and interact with data. Users can easily read data from a Lakehouse directly into a Pandas dataframe³.
For exploration, this makes seamless data reads from One Lake possible. There’s a powerful set of tools is available for data ingestion and data orchestration pipelines with data integration pipelines – a natively integrated part of Microsoft Fabric. Easy-to-build data pipelines can access and transform the data into a format that machine learning can consume³.
An important part of the machine learning process is to understand data through exploration and visualization. Depending on the data storage location, Microsoft Fabric offers a set of different tools to explore and prepare the data for analytics and machine learning³.
For example, users can use SQL or Apache Spark notebooks to query and analyze data using familiar languages like SQL, Python, R, or Scala. They can also use Data Wrangler to perform common data cleansing and transformation tasks using a graphical interface³.
Experimentation and modeling
The next step in the data science workflow is to experiment with different algorithms and techniques to build machine learning models that can address the problem at hand. Microsoft Fabric supports various ways to develop and train machine learning models using Python or R on a single foundation without data movement¹³.
For example, users can use Azure Machine Learning SDK within notebooks to access various features such as automated machine learning, hyperparameter tuning, model explainability, model management, etc³. They can also leverage generative AI and language model services from Azure OpenAI Service to create everyday AI experiences within Fabric¹³.
Microsoft Fabric also provides an Experimentation item that allows users to create experiments that track various metrics and outputs of their machine learning runs. Users can compare different runs within an experiment or across experiments using interactive charts and tables³.
Enrichment and operationalization
The final step in the data science workflow is to deploy and operationalize the machine learning models so that they can be consumed by other applications or users. Microsoft Fabric makes this step easy by providing various options to deploy models as web services or APIs³.
For example, one option for users is they can use the Azure Machine Learning SDK within notebooks to register their models in Azure Machine Learning workspace and deploy them as web services on Azure Container Instances or Azure Kubernetes Service³.
Insights and communication
The ultimate goal of any data science project is to communicate and present the results and insights to stakeholders or customers. Microsoft Fabric enables this by integrating with Power BI, the leading business intelligence tool from Microsoft¹³.
Users can create rich visualizations using Power BI Embedded within Fabric or Power BI Online outside of Fabric. They can also consume reports or dashboards created by analysts using Power BI Online outside of Fabric³. Moreover, they can access insights from Fabric within Microsoft 365 apps using natural language queries or pre-built templates¹³.
Conclusion
In this blogpost, I have shown how Microsoft Fabric offers a comprehensive Data Science experience that can help data scientists complete their end-to-end workflow faster and easier. Microsoft Fabric is an end-to-end analytics product that addresses every aspect of an organization’s analytics needs with a single product and a unified experience¹. It is also an AI-powered platform that leverages generative AI and language model services to enable customers to use and create everyday AI experiences¹. It is also an open and scalable platform that supports open standards and formats, and provides robust data security, governance, and compliance features¹.
In the world of data analytics, the choice between a data warehouse and a lakehouse can be a critical decision. Both have their strengths and are suited to different types of workloads. Microsoft Fabric, a comprehensive analytics solution, offers both options. This blog post will help you understand the differences between a lakehouse and a warehouse in Microsoft Fabric and guide you in making the right choice for your needs.
What is a Lakehouse in Microsoft Fabric?
A lakehouse in Microsoft Fabric is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location. It is a flexible and scalable solution that allows organizations to handle large volumes of data using a variety of tools and frameworks to process and analyze that data. It integrates with other data management and analytics tools to provide a comprehensive solution for data engineering and analytics.
The Lakehouse creates a serving layer by auto-generating an SQL endpoint and a default dataset during creation. This new see-through functionality allows users to work directly on top of the delta tables in the lake to provide a frictionless and performant experience all the way from data ingestion to reporting.
An important distinction between the default warehouse is that it’s a read-only experience and doesn’t support the full T-SQL surface area of a transactional data warehouse. It is important to note that only the tables in Delta format are available in the SQL Endpoint.
Lakehouse vs Warehouse: A Decision Guide
When deciding between a lakehouse and a warehouse in Microsoft Fabric, there are several factors to consider:
Data Volume: Both lakehouses and warehouses can handle unlimited data volumes.
Type of Data: Lakehouses can handle unstructured, semi-structured, and structured data, while warehouses are best suited to structured data.
Developer Persona: Lakehouses are best suited to data engineers and data scientists, while warehouses are more suited to data warehouse developers and SQL engineers.
Developer Skill Set: Lakehouses require knowledge of Spark (Scala, PySpark, Spark SQL, R), while warehouses primarily require SQL skills.
Data Organization: Lakehouses organize data by folders and files, databases and tables, while warehouses use databases, schemas, and tables.
Read Operations: Both lakehouses and warehouses support Spark and T-SQL read operations.
Write Operations: Lakehouses use Spark (Scala, PySpark, Spark SQL, R) for write operations, while warehouses use T-SQL.
Conclusion
The choice between a lakehouse and a warehouse in Microsoft Fabric depends on your specific needs and circumstances. If you’re dealing with large volumes of unstructured or semi-structured data and have developers skilled in Spark, a lakehouse may be the best choice. On the other hand, if you’re primarily dealing with structured data and your developers are more comfortable with SQL, a warehouse might be more suitable.
Remember, with the flexibility offered by Fabric, you can implement either lakehouse or data warehouse architectures or combine these two together to get the best of both with simple implementation.
This blogpost was created with help from ChatGPT Pro
Today at Microsoft Build 2023, a new era in data analytics was ushered in with the announcement of Microsoft Fabric, a powerful unified platform designed to handle all analytics workloads in the cloud. The event marked a significant evolution in Microsoft’s analytics solutions, with Fabric promising a range of features that will undoubtedly transform the way enterprises approach data analytics.
Unifying Capacities: A Groundbreaking Approach
One of the standout features of Microsoft Fabric is the unified capacity model it brings to data analytics. Traditional analytics systems, which often combine products from multiple vendors, suffer from significant wastage due to the inability to utilize idle computing capacity across different systems. Fabric addresses this issue head-on by allowing customers to purchase a single pool of computing power that can fuel all Fabric workloads.
By significantly reducing costs and simplifying resource management, Fabric enables businesses to create solutions that leverage all workloads freely. This all-inclusive approach minimizes friction in the user experience, ensuring that any unused compute capacity in one workload can be utilized by any other, thereby maximizing efficiency and cost-effectiveness.
Early Adoption: Industry Leaders Share Their Experiences
Many industry leaders are already leveraging Microsoft Fabric to streamline their analytics workflows. Plumbing, HVAC, and waterworks supplies distributor Ferguson, for instance, hopes to reduce their delivery time and improve efficiency by using Fabric to consolidate their analytics stack into a unified solution.
Similarly, T-Mobile, a leading provider of wireless communications services in the United States, is looking to Fabric to take their platform and data-driven decision-making to the next level. The ability to query across the lakehouse and warehouse from a single engine, along with the improved speed of Spark compute, are among the Fabric features T-Mobile anticipates will significantly enhance their operations.
Professional services provider Aon also sees significant potential in Fabric, particularly in terms of simplifying their existing analytics stack. By reducing the time spent on building infrastructure, Aon expects to dedicate more resources to adding value to their business.
Integrating Existing Microsoft Solutions
Existing Microsoft analytics solutions such as Azure Synapse Analytics, Azure Data Factory, and Azure Data Explorer will continue to provide a robust, enterprise-grade platform as a service (PaaS) solution for data analytics. However, Fabric represents an evolution of these offerings into a simplified Software as a Service (SaaS) solution that can connect to existing PaaS offerings. Customers will be able to upgrade from their current products to Fabric at their own pace, ensuring a smooth transition to the new system.
Getting Started with Microsoft Fabric
Microsoft Fabric is currently in preview, but you can try out everything it has to offer by signing up for the free trial. No credit card information is required, and everyone who signs up gets a fixed Fabric trial capacity, which can be used for any feature or capability, from integrating data to creating machine learning models. Existing Power BI Premium customers can simply turn on Fabric through the Power BI admin portal. After July 1, 2023, Fabric will be enabled for all Power BI tenants.
There are several resources available for those interested in learning more about Microsoft Fabric, including the Microsoft Fabric website, in-depth Fabric experience announcement blogs, technical documentation, a free e-book on getting started with Fabric, and a guided tour. You can also join the Fabric community to post your questions, share your feedback, and learn from others.
Conclusion
The announcement of Microsoft Fabric at Microsoft Build 2023 marks a pivotal moment in data analytics. By unifying capacities, reducing costs, and simplifying the overall analytics process, Fabric is set to revolutionize the way businesses handle their analytics workloads. As more and more businesses embrace this innovative platform, it will be exciting to see the transformative impact of Microsoft Fabric unfold in the world of data analytics.
This blogpost was created with help from ChatGPT Pro and the new web browser plug-in.
Azure Synapse Analytics is an integrated analytics service that brings together big data and data warehousing. It offers an effective way to ingest, process, and analyze massive amounts of structured and unstructured data. One of the core components of Azure Synapse Analytics is the Spark engine, which enables distributed data processing at scale. In this blog post, we will delve into the best practices for managing and monitoring Spark workloads in Azure Synapse Analytics.
Properly configure Spark clusters:
Azure Synapse Analytics offers managed Spark clusters that can be configured based on workload requirements. To optimize performance, ensure you:
Choose the right VM size for your Spark cluster, considering factors like CPU, memory, and storage.
Configure the number of nodes in the cluster based on the scale of your workload.
Use auto-pause and auto-scale features to optimize resource usage and reduce costs.
Optimize data partitioning:
Data partitioning is crucial for efficiently distributing data across Spark tasks. To optimize partitioning:
Choose an appropriate partitioning key, based on data distribution and query patterns.
Avoid data skew by ensuring that partitions are evenly sized.
Use adaptive query execution to enable dynamic partitioning adjustments during query execution.
Leverage caching:
Caching is an effective strategy for optimizing iterative or repeated Spark workloads. To leverage caching:
Cache intermediate datasets to avoid recomputing expensive transformations.
Use the ‘unpersist()’ method to free memory when cached data is no longer needed.
Monitor cache usage and adjust the storage level as needed.
Monitor Spark workloads:
Azure Synapse Analytics provides various monitoring tools to track Spark workload performance:
Use Synapse Studio for real-time monitoring and visualization of Spark job execution.
Leverage Azure Monitor for gathering metrics and setting up alerts.
Analyze Spark application logs for insights into potential performance bottlenecks.
Optimize Spark SQL:
To optimize Spark SQL performance:
Use the ‘EXPLAIN’ command to understand query execution plans and identify potential optimizations.
Leverage Spark’s built-in cost-based optimizer (CBO) to improve query execution.
Use data partitioning and bucketing techniques to reduce data shuffling.
Use Delta Lake for reliable data storage:
Delta Lake is an open-source storage layer that brings ACID transactions and scalable metadata handling to Spark. Using Delta Lake can help:
Improve data reliability and consistency with transactional operations.
Enhance query performance by leveraging Delta Lake’s optimized file layout and indexing capabilities.
Simplify data management with features like schema evolution and time-travel queries.
Optimize data ingestion:
To optimize data ingestion in Azure Synapse Analytics:
Use Azure Data Factory or Azure Logic Apps for orchestrating and automating data ingestion pipelines.
Leverage PolyBase for efficient data loading from external sources into Synapse Analytics.
Use the COPY statement to efficiently ingest large volumes of data.
Conclusion:
Managing and monitoring Spark workloads in Azure Synapse Analytics is essential for ensuring optimal performance and resource utilization. By following the best practices outlined in this blog post, you can optimize your Spark applications and extract valuable insights from your data.
This blogpost was created with help from ChatGPT Pro.