Microsoft recently introduced OneLake, a part of Microsoft Fabric, designed to accelerate data potential for the era of AI. One Lake provides a unified intelligent data foundation for all analytic workloads, integrating Power BI, Data Factory, and the next generation of Synapse. This solution offers customers a high-performing and easy-to-manage modern analytics solution.
OneLake: The OneDrive for All Your Data
OneLake provides a single data lake for your entire organization. For every Fabric tenant, there will always be exactly one OneLake, never two, never zero. There is no infrastructure to manage or set up. The concept of a tenant is a unique benefit of a SaaS service. It allows Microsoft to automatically provide a single management and governance boundary for the entire organization, which is ultimately under the control of a tenant admin.
Breaking down Data Silos with OneLake
OneLake aims to provide a data lake as a service without you needing to build it yourself. It enables different business groups to work independently without going through a central gatekeeper. Different workspaces allow different parts of the organization to work independently while still contributing to the same data lake. Each workspace can have its own administrator, access control, region, and capacity for billing.
OneLake: Spanning the Globe
OneLake covers this by spanning the globe as well. Different workspaces can reside in different regions. This means that any data stored in those workspaces will also reside in those countries. OneLake is built on top of Azure Data Lake Storage Gen2 under the covers. It will use multiple storage accounts in different regions, however, OneLake will virtualize them into one logical lake.
OneLake: Open Data Lake
OneLake is not just a Fabric data lake or a Microsoft data lake, it is an open data lake. In addition to being built on ADLS Gen2, OneLake supports the same ADLS Gen2 APIs and SDKs, making it compatible with existing ADLs applications, including Azure Databricks and Azure HDInsights.
OneLake: One Copy
OneLake with One Copy aims to get the most value possible out of a single copy of data without data movement or duplication. It allows data to be virtualized into a single data product without data movement, data duplication, or changing the ownership of the data.
OneLake: One Security
One Security is a feature in active development that aims to let you secure the data once and use it anywhere. One Security will bring a shared universal security model which you will define in OneLake. These security definitions will live alongside the data itself. This is an important detail. Security will live with the data rather than living downstream in the serving or presentation layers.
OneLake Data Hub
The OneLake Data Hub is the central location within Fabric to discover, manage, and reuse data. It serves all users from data engineer to business user. Data can easily be discovered by its domain, for example, Finance, HR, or Sales, so users find what actually matters to them.
In conclusion, OneLake is a game-changer in the world of data management and analytics. It provides a unified, intelligent data foundation that breaks down data silos, enabling organizations to harness the full potential of their data in the era of AI.
This blogpost was created with help from ChatGPT Pro.
Microsoft Fabric is a powerful tool for data engineers, enabling them to build out a lakehouse architecture for their organizational data. In this blog post, we will walk you through the key experiences that Microsoft Fabric.
Creating a Lakehouse
A lakehouse is a new experience that combines the power of a data lake and a data warehouse. It serves as a central repository for all Fabric data. To create a lakehouse, you start by creating a new lakehouse artifact and giving it a name. Once created, you land in the empty Lakehouse Explorer.
Importing Data into the Lakehouse
There are several ways to bring data into the lakehouse. You can upload files and folders from your local machine, use data flows (a low-code tool with hundreds of connectors), or leverage the pipeline copy activity to bring in petabytes of data at scale. Most of the marketing data in the lakehouse is in Delta tables, which are automatically created with no additional effort. You can easily explore the tables, see their schema, and even view the underlying files.
Adding Unstructured Data
In addition to structured data, you might want to add some unstructured customer reviews to accompany your campaign data. If this data already exists in storage, you can simply point to it with no data movement necessary. This is done by adding a new shortcut, which allows you to create a virtual table and virtual files inside your lakehouse. Shortcuts enable you to select from a variety of sources, including lakehouses and warehouses in Fabric, but also external storage like ADLS Gen 2 and even Amazon S3.
Leveraging the Data
Once all your data is ready in the lakehouse, there are many ways to use it. As a data engineer or data scientist, you can open up the lakehouse in a notebook and leverage Spark to continue transforming the data or build a machine learning model. As a SQL professional, you can navigate to the SQL endpoint of the lakehouse where you can write SQL queries, create views and functions, all on top of the same Delta tables. As a business analyst, you can navigate to the built-in modeling view and start developing your BI data model directly in the same warehouse experience.
Configuring your Spark Environment
As an administrator, you can configure the Spark environment for your data engineers. This is done in the capacity admin portal, where you can access the Spark compute settings for data engineers and data scientists. You can set a default runtime and default Spark properties, and also turn on the ability for workspace admins to configure their own custom Spark pools.
Collaborative Data Development
Microsoft Fabric also provides a rich developer experience, enabling users to collaborate easily, work with their lakehouse data, and leverage the power of Spark. You can view your colleagues’ code updates in real time, install ML libraries for your project, and use the built-in charting capabilities to explore your data. The notebook has a built-in resource folder which makes it easy to store scripts or other code files you might need for the project.
In conclusion, Microsoft Fabric provides a frictionless experience for data engineers building out their enterprise data lakehouse and can easily democratize this data for all users in an organization. It’s a powerful tool that combines the power of a data lake and a data warehouse, providing a comprehensive solution for data engineering tasks.
This blogpost was created with help from ChatGPT Pro
Spark Compute is a key component of Microsoft Fabric, the end-to-end, unified analytics platform that brings together all the data and analytics tools that organizations need. Spark Compute enables data engineering and data science scenarios on a fully managed Spark compute platform that delivers unparalleled speed and efficiency.
What is Spark Compute?
Spark Compute is a way of telling Spark what kind of resources you need for your data analysis tasks. You can give your Spark pool a name, and choose how many and how big the nodes (the machines that do the work) are. You can also tell Spark how to adjust the number of nodes depending on how much work you have.
Spark Compute operates on OneLake, the data lake service that powers Microsoft Fabric. OneLake provides a single place to store and access all your data, whether it is structured, semi-structured, or unstructured. OneLake also supports data from other sources, such as Amazon S3 and (soon) Google Cloud Platform³.
Spark Compute supports both batch and streaming scenarios, and integrates with various tools and frameworks, such as Azure OpenAI Service, Azure Machine Learning, Databricks, Delta Lake, and more. You can use Spark Compute to perform data ingestion, transformation, exploration, analysis, machine learning, and AI tasks on your data.
How to use Spark Compute?
There are two ways to use Spark Compute in Microsoft Fabric: starter pools and custom pools.
Starter pools
Starter pools are a fast and easy way to use Spark on the Microsoft Fabric platform within seconds. You can use Spark sessions right away, instead of waiting for Spark to set up the nodes for you. This helps you do more with data and get insights quicker.
Starter pools have Spark clusters that are always on and ready for your requests. They use medium nodes that will dynamically scale-up based on your Spark job needs. Starter pools also have default settings that let you install libraries quickly without slowing down the session start time.
You only pay for starter pools when you are using Spark sessions to run queries. You don’t pay for the time when Spark is keeping the nodes ready for you.
Custom pools
A custom pool is a way of creating a tailored Spark pool according to your specific data engineering and data science requirements. You can customize various aspects of your custom pool, such as:
Node size: You can choose from different node sizes that offer different combinations of CPU cores, memory, and storage.
Node count: You can specify the minimum and maximum number of nodes you want in your custom pool.
Autoscale: You can enable autoscale to let Spark automatically adjust the number of nodes based on the workload demand.
Dynamic allocation: You can enable dynamic allocation to let Spark dynamically allocate executors (the processes that run tasks) based on the workload demand.
Libraries: You can install libraries from various sources, such as Maven, PyPI, CRAN, or your workspace.
Properties: You can configure custom properties for your custom pool, such as spark.executor.memory or spark.sql.shuffle.partitions.
Creating a custom pool is free; you only pay when you run a Spark job on the pool. If you don’t use your custom pool for 2 minutes after your job is done, Spark will automatically delete it. This is called the \”time to live\” property, and you can change it if you want.
If you are a workspace admin, you can also create default custom pools for your workspace, and make them the default option for other users. This way, you can save time and avoid setting up a new custom pool every time you run a notebook or a Spark job.
Custom pools take about 3 minutes to start, because Spark has to get the nodes from Azure.
Conclusion
Spark Compute is a powerful and flexible way of using Spark on Microsoft Fabric. It enables you to perform various data engineering and data science tasks on your data stored in OneLake or other sources. It also offers different options for creating and managing your Spark pools according to your needs and preferences.
If you want to learn more about Spark Compute in Microsoft Fabric, check out these resources:
Have questions about Microsoft Fabric? Here’s a quick FAQ to help you out:
Q: What is Microsoft Fabric? A: Microsoft Fabric is an end-to-end, unified analytics platform that brings together all the data and analytics tools that organizations need. Fabric integrates technologies like Azure Data Factory, Azure Synapse Analytics, and Power BI into a single unified product, empowering data and business professionals alike to unlock the potential of their data and lay the foundation for the era of AI.
Q: What are the benefits of using Microsoft Fabric? A: Some of the benefits of using Microsoft Fabric are:
It simplifies analytics by providing a single product with a unified experience and architecture that provides all the capabilities required for a developer to extract insights from data and present it to the business user.
It enables faster innovation by helping every person in your organization act on insights from within Microsoft 365 apps, such as Microsoft Excel and Microsoft Teams.
It reduces costs by eliminating data sprawl and creating custom views for everyone.
It supports open and scalable solutions that give data stewards additional control with built-in security, governance, and compliance.
It accelerates analysis by developing AI models on a single foundation without data movement —reducing the time data scientists need to deliver value.
Q: How can I get started with Microsoft Fabric? A: You can get started with Microsoft Fabric by signing up for a free trial here: https://www.microsoft.com/microsoft-fabric/try-for-free. You will get a fixed Fabric trial capacity for each business user, which may be used for any feature or capability.
Q: What are the main components of Microsoft Fabric? A: The main components of Microsoft Fabric are:
Unified data foundation: A data lake-centric hub that helps data engineers connect and curate data from different sources—eliminating sprawl and creating custom views for everyone¹.
Role-tailored tools: A set of tools that cater to different roles in the analytics process, such as data engineering, data warehousing, data science, real-time analytics, and business intelligence.
AI-powered capabilities: A set of capabilities that leverage generative AI and language model services, such as Azure OpenAI Service, to enable customers to use and create everyday AI experiences that are reinventing how employees spend their time¹.
Open, governed foundation: A foundation that supports open standards and formats, such as Apache Spark, SQL, Python, R, and Parquet, and provides robust data security, governance, and compliance features.
Cost management: A feature that helps customers optimize their spending on Fabric by providing visibility into their usage and costs across different services and resources.
Q: How does Microsoft Fabric integrate with other Microsoft products? A: Microsoft Fabric integrates seamlessly with other Microsoft products, such as:
Microsoft 365: Users can access insights from Fabric within Microsoft 365 apps, such as Excel and Teams, using natural language queries or pre-built templates.
Azure OpenAI Service: Users can leverage generative AI and language model services from Azure OpenAI Service to create everyday AI experiences within Fabric.
Azure Data Explorer: Users can ingest, store, analyze, and visualize massive amounts of streaming data from various sources using Azure Data Explorer within Fabric.
Azure IoT Hub: Users can connect millions of devices and stream real-time data to Fabric using Azure IoT Hub.
Q: How does Microsoft Fabric compare with other analytics platforms? A: Microsoft Fabric differs from other analytics platforms in several ways:
It is an end-to-end analytics product that addresses every aspect of an organization’s analytics needs with a single product and a unified experience.
It is a SaaS product that is automatically integrated and optimized, and users can sign up within seconds and get real business value within minutes.
It is an AI-powered platform that leverages generative AI and language model services to enable customers to use and create everyday AI experiences.
It is an open and scalable platform that supports open standards and formats, and provides robust data security, governance, and compliance features.
Q: Who are the target users of Microsoft Fabric? A: Microsoft Fabric is designed for enterprises that want to transform their data into a competitive advantage. It caters to different roles in the analytics process, such as:
Data engineers: They can use Fabric to connect and curate data from different sources, create custom views for everyone, and manage powerful AI models without data movement.
Data warehousing professionals: They can use Fabric to build scalable data warehouses using SQL or Apache Spark, perform complex queries across structured and unstructured data sources, and optimize performance using intelligent caching.
Data scientists: They can use Fabric to develop AI models using Python or R on a single foundation without data movement, leverage generative AI and language model services from Azure OpenAI Service, and deploy models as web services or APIs.
Data analysts: They can use Fabric to explore and analyze data using SQL or Apache Spark notebooks or Power BI Desktop within Fabric, create rich visualizations using Power BI Embedded within Fabric or Power BI Online outside of Fabric.
Business users: They can use Fabric to access insights from within Microsoft 365 apps using natural language queries or pre-built templates, or use Power BI Online outside of Fabric to consume reports or dashboards created by analysts.
Data science is the process of extracting insights from data using various methods and techniques, such as statistics, machine learning, and artificial intelligence. Data science can help organizations solve complex problems, optimize processes, and create new opportunities.
However, data science is not an easy task. It involves multiple steps and challenges, such as:
Finding and accessing relevant data sources
Exploring and understanding the data
Cleaning and transforming the data
Experimenting and building machine learning models
Deploying and operationalizing the models
Communicating and presenting the results
To perform these steps effectively, data scientists need a powerful and flexible platform that can support their end-to-end workflow and enable them to collaborate with other roles, such as data engineers, analysts, and business users.
This is where Microsoft Fabric comes in.
Microsoft Fabric is an end-to-end, unified analytics platform that brings together all the data and analytics tools that organizations need. Fabric integrates technologies like Azure Data Factory, Azure Synapse Analytics, and Power BI into a single unified product, empowering data and business professionals alike to unlock the potential of their data and lay the foundation for the era of AI¹.
In this blogpost, I will focus on how Microsoft Fabric offers a rich and comprehensive Data Science experience that can help data scientists complete their tasks faster and easier.
The Data Science experience in Microsoft Fabric
The Data Science experience in Microsoft Fabric consists of multiple native-built features that enable collaboration, data acquisition, sharing, and consumption in a seamless way. In this section, I will describe some of these features and how they can help data scientists in each step of their workflow.
Data discovery and pre-processing
The first step in any data science project is to find and access relevant data sources. Microsoft Fabric users can interact with data in OneLake using the Lakehouse item. Lakehouse easily attaches to a Notebook to browse and interact with data. Users can easily read data from a Lakehouse directly into a Pandas dataframe³.
For exploration, this makes seamless data reads from One Lake possible. There’s a powerful set of tools is available for data ingestion and data orchestration pipelines with data integration pipelines – a natively integrated part of Microsoft Fabric. Easy-to-build data pipelines can access and transform the data into a format that machine learning can consume³.
An important part of the machine learning process is to understand data through exploration and visualization. Depending on the data storage location, Microsoft Fabric offers a set of different tools to explore and prepare the data for analytics and machine learning³.
For example, users can use SQL or Apache Spark notebooks to query and analyze data using familiar languages like SQL, Python, R, or Scala. They can also use Data Wrangler to perform common data cleansing and transformation tasks using a graphical interface³.
Experimentation and modeling
The next step in the data science workflow is to experiment with different algorithms and techniques to build machine learning models that can address the problem at hand. Microsoft Fabric supports various ways to develop and train machine learning models using Python or R on a single foundation without data movement¹³.
For example, users can use Azure Machine Learning SDK within notebooks to access various features such as automated machine learning, hyperparameter tuning, model explainability, model management, etc³. They can also leverage generative AI and language model services from Azure OpenAI Service to create everyday AI experiences within Fabric¹³.
Microsoft Fabric also provides an Experimentation item that allows users to create experiments that track various metrics and outputs of their machine learning runs. Users can compare different runs within an experiment or across experiments using interactive charts and tables³.
Enrichment and operationalization
The final step in the data science workflow is to deploy and operationalize the machine learning models so that they can be consumed by other applications or users. Microsoft Fabric makes this step easy by providing various options to deploy models as web services or APIs³.
For example, one option for users is they can use the Azure Machine Learning SDK within notebooks to register their models in Azure Machine Learning workspace and deploy them as web services on Azure Container Instances or Azure Kubernetes Service³.
Insights and communication
The ultimate goal of any data science project is to communicate and present the results and insights to stakeholders or customers. Microsoft Fabric enables this by integrating with Power BI, the leading business intelligence tool from Microsoft¹³.
Users can create rich visualizations using Power BI Embedded within Fabric or Power BI Online outside of Fabric. They can also consume reports or dashboards created by analysts using Power BI Online outside of Fabric³. Moreover, they can access insights from Fabric within Microsoft 365 apps using natural language queries or pre-built templates¹³.
Conclusion
In this blogpost, I have shown how Microsoft Fabric offers a comprehensive Data Science experience that can help data scientists complete their end-to-end workflow faster and easier. Microsoft Fabric is an end-to-end analytics product that addresses every aspect of an organization’s analytics needs with a single product and a unified experience¹. It is also an AI-powered platform that leverages generative AI and language model services to enable customers to use and create everyday AI experiences¹. It is also an open and scalable platform that supports open standards and formats, and provides robust data security, governance, and compliance features¹.
In the world of data analytics, the choice between a data warehouse and a lakehouse can be a critical decision. Both have their strengths and are suited to different types of workloads. Microsoft Fabric, a comprehensive analytics solution, offers both options. This blog post will help you understand the differences between a lakehouse and a warehouse in Microsoft Fabric and guide you in making the right choice for your needs.
What is a Lakehouse in Microsoft Fabric?
A lakehouse in Microsoft Fabric is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location. It is a flexible and scalable solution that allows organizations to handle large volumes of data using a variety of tools and frameworks to process and analyze that data. It integrates with other data management and analytics tools to provide a comprehensive solution for data engineering and analytics.
The Lakehouse creates a serving layer by auto-generating an SQL endpoint and a default dataset during creation. This new see-through functionality allows users to work directly on top of the delta tables in the lake to provide a frictionless and performant experience all the way from data ingestion to reporting.
An important distinction between the default warehouse is that it’s a read-only experience and doesn’t support the full T-SQL surface area of a transactional data warehouse. It is important to note that only the tables in Delta format are available in the SQL Endpoint.
Lakehouse vs Warehouse: A Decision Guide
When deciding between a lakehouse and a warehouse in Microsoft Fabric, there are several factors to consider:
Data Volume: Both lakehouses and warehouses can handle unlimited data volumes.
Type of Data: Lakehouses can handle unstructured, semi-structured, and structured data, while warehouses are best suited to structured data.
Developer Persona: Lakehouses are best suited to data engineers and data scientists, while warehouses are more suited to data warehouse developers and SQL engineers.
Developer Skill Set: Lakehouses require knowledge of Spark (Scala, PySpark, Spark SQL, R), while warehouses primarily require SQL skills.
Data Organization: Lakehouses organize data by folders and files, databases and tables, while warehouses use databases, schemas, and tables.
Read Operations: Both lakehouses and warehouses support Spark and T-SQL read operations.
Write Operations: Lakehouses use Spark (Scala, PySpark, Spark SQL, R) for write operations, while warehouses use T-SQL.
Conclusion
The choice between a lakehouse and a warehouse in Microsoft Fabric depends on your specific needs and circumstances. If you’re dealing with large volumes of unstructured or semi-structured data and have developers skilled in Spark, a lakehouse may be the best choice. On the other hand, if you’re primarily dealing with structured data and your developers are more comfortable with SQL, a warehouse might be more suitable.
Remember, with the flexibility offered by Fabric, you can implement either lakehouse or data warehouse architectures or combine these two together to get the best of both with simple implementation.
This blogpost was created with help from ChatGPT Pro
Data engineering plays a crucial role in the modern data-driven world. It involves designing, building, and maintaining infrastructures and systems that enable organizations to collect, store, process, and analyze large volumes of data. Microsoft Fabric, a comprehensive analytics solution, offers a robust platform for data engineering. This blog post will provide a detailed overview of data engineering in Microsoft Fabric.
What is Data Engineering in Microsoft Fabric?
Data engineering in Microsoft Fabric enables users to design, build, and maintain infrastructures and systems that allow their organizations to collect, store, process, and analyze large volumes of data. Microsoft Fabric provides various data engineering capabilities to ensure that your data is easily accessible, well-organized, and of high-quality.
From the data engineering homepage, users can perform a variety of tasks:
Create and manage your data using a lakehouse
Design pipelines to copy data into your lakehouse
Use Spark Job definitions to submit batch/streaming jobs to Spark clusters
Use notebooks to write code for data ingestion, preparation, and transformation
Lakehouse Architecture
Lakehouses are data architectures that allow organizations to store and manage structured and unstructured data in a single location. They use various tools and frameworks to process and analyze that data. This can include SQL-based queries and analytics, as well as machine learning and other advanced analytics techniques.
Microsoft Fabric: An All-in-One Analytics Solution
Microsoft Fabric is an all-in-one analytics solution for enterprises that covers everything from data movement to data science, real-time analytics, and business intelligence. It offers a comprehensive suite of services, including data lake, data engineering, and data integration, all in one place.
Traditionally, organizations have been building modern data warehouses for their transactional and structured data analytics needs and data lakehouses for big data (semi/unstructured) data analytics needs. These two systems ran in parallel, creating silos, data duplicity, and increased total cost of ownership.
Fabric, with its unification of data store and standardization on Delta Lake format, allows you to eliminate silos, remove data duplicity, and drastically reduce total cost of ownership. With the flexibility offered by Fabric, you can implement either lakehouse or data warehouse architectures or combine these two together to get the best of both with simple implementation.
Data Engineering Capabilities in Microsoft Fabric
Fabric makes it quick and easy to connect to Azure Data Services, as well as other cloud-based platforms and on-premises data sources, for streamlined data ingestion. You can quickly build insights for your organization using more than 200 native connectors. These connectors are integrated into the Fabric pipeline and utilize the user-friendly drag-and-drop data transformation with dataflow.
Fabric standardizes on Delta Lake format. Which means all the Fabric engines can access and manipulate the same dataset stored in OneLake without duplicating data. This storage system provides the flexibility to build lakehouses using a medallion architecture or a data mesh, depending on your organizational requirement. You can choose between a low-code or no-code experience for data transformation, utilizing either pipelines/dataflows or notebook/Spark for a code-first experience.
Power BI can consume data from the Lakehouse for reporting and visualization. Each Lakehouse has a built-in TDS/SQL endpoint, for easy connectivity and querying of data in the Lakehouse tables from other reporting tools.
Conclusion
Microsoft Fabric is a powerful tool for data engineering, providing a comprehensive suite of services and capabilities for data collection, storage, processing, and analysis. Whether you’re looking to implement a lakehouse or data warehouse architecture, or a combination of both, Fabric offers the flexibility and functionality to meet your data engineering needs.
This blogpost was created with help from ChatGPT Pro