Azure Databricks (part 1/2)

Tihana Britvic

DATA ENGINEER

WHAT IS AZURE DATABRICKS?

Azure Databricks is a fast, easy and collaborative Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. It was designed in collaboration with the Apache Spark founders and is integrated with Azure providing one-click setup, streamlined workflows, and an interactive workspace that enables data scientists, data engineers, and business analysts to work together while boosting their productivity.

Azure Databricks supports Python, Scala, R, Java, and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch, and scikit-learn. For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches or streamed near real-time using Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage.

Apache Spark-based analytics platform

Azure Databricks comprises the complete open-source Apache Spark cluster technologies and capabilities. Spark in Azure Databricks includes Spark SQL, DataFrames, Streaming, MLib, GraphX, and Spark Core API.

In this blog post, we will demonstrate the use of Spark SQL and DataFrames.

AZURE DATABRICKS AND AZURE DATA FACTORY INTEGRATION

The integration of Azure Databricks and Azure Data Factory (ADF) comes in terms of conducting the support for doing ETL/ELT through Azure Data Factory. This is often done through pipelines which are usually contained out three steps:

Ingest data
Prepare and transform the ingested data in Azure Databricks as a Notebook activity step in data factory pipelines
Monitor and manage created workflows

These steps can be seen in the picture below.

Example

To show basic concepts of integration of Azure Data Bricks into Azure Data Factory we are going to create an example of data transformation followed by the execution of Python notebook on the gained data just like in the previous picture with one change: we will write transformed data into the same storage.

Steps of our implementation are:

Create Azure Data Storage
Create Azure Data Factory
Create a new cluster

Create a new notebook
Schedule job
Integrate Azure Data Factory and Azure Databrick

DATASETS

The data which used in the example are the events of storms happening across the USA collected by the National Oceanic and Atmospheric Administration (NOAA). The list of all the available datasets can be found here.

AZURE DATA STORAGE

Create a new Storage account by choosing to create a new resource in the Azure portal under the Storage options. In order to create a new account, several information is needed:

Subscription
Resource group – choose the option to create a new resource group
Storage account name
In the example, the Performance, Account kind, Replication, and Access were set to standard settings. If needed, these values can be changed and modeled to fix the use case

By clicking on the Review + create button the new resource group and the new storage will be created.
After the deployment is succeeded, go to created resource and under Blob service click on a Blobs icon.
This will enable you to create a new container. Since the data consists of information about storm events, we will be creating the storm container.

After the container has been created, it can be accessed, and the data can be uploaded. You can upload the data by simply clicking on the upload your dataset into the blob container.

The result should be as can be seen in the picture below.

The data can also be upload using the data using the Microsoft SQL Server Management (MSSM). The instructions for that operation can be found here.

Note: the connection of the Azure Data Factory and Data Storage will be explained be later, however you will be needing a storage account name,key and an end point suffix which can be found under access keys in the storage settings.

AZURE DATA FACTORY

Create an Azure Data Factory resource for data transformation with the following values: name, subscription, resource group, version and location.

The data can also be upload using the data using the Microsoft SQL Server Management (MSSM). The instructions for that operation can be found here.

AZURE DATA FACTORY

Create an Azure Data Factory resource for data transformation with the following values: name, subscription, resource group, version and location.

Next step is to connect to the container from the connected storage and browse the wanted dataset. In our case it is done as in the picture below.

With this step, the new source named StormEvents table will be imported.

3 steps of data transformation:

Generating the subset of the dataset using Select option on the EVENT_ID, STATE, YEAR, MONTH_NAME, EVENT_TYPE, INJURIES_DIRECT, INJURIES_INDIRECT, DEATHS_DIRECT, DEATHS_INDIRECT, DAMAGE_PROPERTY, TOR_F_SCALE, TOR_LENGTH and TOR_WIDTH columns.

Using Conditional If to split the data in two tables

toInteger(INJURIES_DIRECT) != 0 || toInteger(INJURIES_INDIRECT) != 0 ||
toInteger(DEATHS_DIRECT) != 0 || toInteger(DEATHS_INDIRECT) != 0

Sinking the created tables into the storage.

Note: make sure the writing permissions on storage are enabled

After the data flow is validated and published the created tables can be seen under Factory Resources→Datasets.

The same result should be visible in the connected Data Storage.

AZURE DATABRICKS

Add a new Analytics resource Azure Databricks Service with the following values: workspace name, subscription, resource group, location, and the pricing tier.

Newly created Azure Databricks Service must be accessed by clicking on Launch Workspace icon.

After the workspace has been created choose among the Common tasks the New cluster icon to create the new cluster. Azure Databricks clusters provide an unified platform for various use cases.

For example, one basic cluster will be created with the values as below.

Schedule job

To demonstrate the Azure Databricks scheduling, a job will be scheduled for the previously created notebook will be running the notebook every hour. This is easily done by clicking on the Schedule icon in Azure Databricks notebook. The scheduling is done in three steps, all of which are visible on the picture below.

AZURE DATABRICKS IN AZURE DATA FACTORY PIPELINE

Create the new pipeline in Azure Data Factory by clicking on the add new pipeline icon under Factory Resources → Pipeline.

Stages of example pipeline:

Move & Transform actions – Data Flow
Databricks Notebooks
Validation of data
Deleting the created tables

Inject the previously created Data Flow as a first step in the new pipeline.

Import of the notebook in the Azure Data Factory has two steps:

1. Link Azure Databricks Service:

A token is required which can be generated by clicking the Databricks workspace Account icon under user settings.
Note: store the generated token since there is no way of recovering it. If the generation is done successfully, the Token ID will be listed in the user settings.

2. Browse the notebook you want to import

Under Settings find the Browse option and find the Notebook path

To check the processed done in Notebook, the Validation and the Delete steps are added in the pipeline. However, since they do not affect our Azure Databricks integration into Azure Data Factory they will not be explained in this blog post.

CLEARING THE RESOURCES

After the Azure Data Factory and Azure Databricks integration is done, remember to delete the remaining resources. The simplest way to accomplish this is to delete the entire resource group. However, some of the resources are still needed, delete the surplus resources.