How Spark Compute Works in Microsoft Fabric

Spark Compute is a key component of Microsoft Fabric, the end-to-end, unified analytics platform that brings together all the data and analytics tools that organizations need. Spark Compute enables data engineering and data science scenarios on a fully managed Spark compute platform that delivers unparalleled speed and efficiency.

What is Spark Compute?

Spark Compute is a way of telling Spark what kind of resources you need for your data analysis tasks. You can give your Spark pool a name, and choose how many and how big the nodes (the machines that do the work) are. You can also tell Spark how to adjust the number of nodes depending on how much work you have.

Spark Compute operates on OneLake, the data lake service that powers Microsoft Fabric. OneLake provides a single place to store and access all your data, whether it is structured, semi-structured, or unstructured. OneLake also supports data from other sources, such as Amazon S3 and (soon) Google Cloud Platform³.

Spark Compute supports both batch and streaming scenarios, and integrates with various tools and frameworks, such as Azure OpenAI Service, Azure Machine Learning, Databricks, Delta Lake, and more. You can use Spark Compute to perform data ingestion, transformation, exploration, analysis, machine learning, and AI tasks on your data.

How to use Spark Compute?

There are two ways to use Spark Compute in Microsoft Fabric: starter pools and custom pools.

Starter pools

Starter pools are a fast and easy way to use Spark on the Microsoft Fabric platform within seconds. You can use Spark sessions right away, instead of waiting for Spark to set up the nodes for you. This helps you do more with data and get insights quicker.

Starter pools have Spark clusters that are always on and ready for your requests. They use medium nodes that will dynamically scale-up based on your Spark job needs. Starter pools also have default settings that let you install libraries quickly without slowing down the session start time.

You only pay for starter pools when you are using Spark sessions to run queries. You don’t pay for the time when Spark is keeping the nodes ready for you.

Custom pools

A custom pool is a way of creating a tailored Spark pool according to your specific data engineering and data science requirements. You can customize various aspects of your custom pool, such as:

  • Node size: You can choose from different node sizes that offer different combinations of CPU cores, memory, and storage.
  • Node count: You can specify the minimum and maximum number of nodes you want in your custom pool.
  • Autoscale: You can enable autoscale to let Spark automatically adjust the number of nodes based on the workload demand.
  • Dynamic allocation: You can enable dynamic allocation to let Spark dynamically allocate executors (the processes that run tasks) based on the workload demand.
  • Libraries: You can install libraries from various sources, such as Maven, PyPI, CRAN, or your workspace.
  • Properties: You can configure custom properties for your custom pool, such as spark.executor.memory or spark.sql.shuffle.partitions.

Creating a custom pool is free; you only pay when you run a Spark job on the pool. If you don’t use your custom pool for 2 minutes after your job is done, Spark will automatically delete it. This is called the \”time to live\” property, and you can change it if you want.

If you are a workspace admin, you can also create default custom pools for your workspace, and make them the default option for other users. This way, you can save time and avoid setting up a new custom pool every time you run a notebook or a Spark job.

Custom pools take about 3 minutes to start, because Spark has to get the nodes from Azure.

Conclusion

Spark Compute is a powerful and flexible way of using Spark on Microsoft Fabric. It enables you to perform various data engineering and data science tasks on your data stored in OneLake or other sources. It also offers different options for creating and managing your Spark pools according to your needs and preferences.

If you want to learn more about Spark Compute in Microsoft Fabric, check out these resources:

This blogpost was created with help from ChatGPT Pro and Bing