Overview of Azure Data Factory Version 2

Ivan Celikovic

PRINCIPAL DATA ENGINEER

An easy to use cloud-based data integration service that allows you to securely lift data to cloud and do rudimentary transformations.

Azure Data Factory

In one of our previous blogs Azure Data Factory – Transfer data between on-premises and Azure cloud we already introduced Azure Data Factory tool, its main components and capabilities. In short, it is an easy to use cloud-based data integration service that allows you to securely lift data to cloud and do rudimentary transformations. And while it has many advantages, plenty of data engineers that are used to traditional ETL tools didn’t use it all too often because of some basic drawbacks. Notably there was no GUI (everything was done in JSON), changing scheduling was a major pain, scheduling worked only on dataset-based time slices, chaining more activities wasn’t solved in the best possible way and so on. Azure Data Factory version 2 (ADFv2) has been available for public preview for some time now, and while it is still not officially published and available in all regions, it has been well received in the community. The general direction it is going in seems promising. So, without further ado, here is a basic overview of the new features.

Next Generation of Integration

Key component of version 2 is Integration Runtime which basically allows you to combine data from different environments, integrate it using various Microsoft compute services and ADF activities and even natively execute SQL Server Integration Services packages. Azure-SSIS Integration Runtime is interesting in particular because it provides advanced integration and transformation options from Microsoft’s flagship ETL tool. These are official examples of different use of IR:

Your main Data Factory contains metadata and it orchestrates your workflows. You can set up IR-s on different Azure locations around the world to move and combine data and call compute services in different locations. In the following example Data Factory is located in East US region, with 2 IR-s located in UK South and North Europe. That means that you can trigger a pipeline in Virginia that moves data in London and runs e.g. a machine learning algorithm there, then use that output to run an SSIS package on IR in Ireland:

GUI and activities

The biggest deterrent for new users was most definitely User Interface or the lack of it. Everything was configured in JSON files. ADFv2 offers a pleasing Web Based Developer User Interface. You can easily create new activities by drag and dropping Azure Resource Manager templates and configuring settings. Activities now support branching via Control flow; they can be connected on success, failure and completion into long integration chains.

It is also worth mentioning Lookup and HTTP activities. The former allows us to lookup a list to be used in downstream activities and with the latter you can call web services direct from your pipeline. We’ll use of both of them in an example, so you get a clearer understanding. Some of the features from version 1 are still not implemented in GUI though, so there may come a time where you will have to code something manually in JSON files and upload them using Powershell.

Variables and scheduling

As I mentioned earlier, scheduling and its modification wasn’t resolved in the best way in version 1. ADFv2 offers much more in that area. First, we have System Variables which return pipeline and trigger-based values such as pipeline name, trigger type, name and id, scheduled and start time, etc. Expressions and Parameters provide means to create dynamic variables. Parameters are key-value pairs of read-only configuration settings defined in pipelines though which you can pass custom values.

Scheduling is also one of much improved features. Whereas earlier you had to define it via pipeline and dataset availability (which isn’t practical), now it is stored in separate entities called Triggers. Triggers can be manual, event-based (for example file arrival), scheduled (standard ETL orchestration where you define time and recurrence) and tumbling window (which is similar to time slices from version 1).

Linked services and datasets

Linked services (database connections) and datasets (data models) at their core are still the same. LS now have a connectVia property, so you can define specific Integration Runtime compute environment. You don’t need to define dataset availability schedules anymore; that part has been moved to triggers.

Monitoring and security

New Visual Monitoring Capabilities offer an experience similar to v1, but still different and easier to understand. You can easily see status of recent pipeline runs, rerun them, check their parameters, drill deeper into activities and see their logs. Pipeline runs can be filtered so you can see only those that interest you. Pipelines, activities and triggers now have their instances that can be tracked via unique ID-s.

Operations Management Suite and Azure Monitor can now be integrated with ADFv2 which offers richer logs and upgraded metrics.

ADF credentials can be securely stored in Azure Key Vault via Azure Active Directory offering a solution closer to enterprise architecture.

Powershell commands

As in the earlier version, Azure RM PowerShell module offers a variety of commands that allow you to bypass Web UI and directly make changes. Module now contains even more commands. This can be helpful for automatic deployment, scheduling with scripts, versioning of code, etc. Here you can find the whole list of Powershell commands.

Conclusion

While Azure Data Factory still doesn’t offer all the capabilities of traditional ETL tools, it offers very much and is a different hybrid solution. It my lack rich transformation options, doesn’t offer source repository integration or central metadata store and has other smaller weaknesses, but it offers many unique capabilities and integration options. ETL-As-A-Service is becoming more popular in situations where agile teams need to deliver solutions quickly with low initial cost. As the pace of data moving to Cloud gathers momentum and pay-what-you-use pricing becomes more popular, such tools will be more and more in demand, and ADFv2 is certainly one of top contenders.