Azure Data Factory as ELT Tool in Complex Pipeline Build

Azure Data Factory as ELT Tool in Complex Pipeline Build

 

Since everything is data-driven in today’s market, loading a data warehouse can be extremely time-consuming. The process of extracting, loading and transforming (ELT) data streamlines the tasks of modern data warehousing and managing big data so that businesses can focus mainly on data mining and actionable insights. Extract/Load/Transform (ELT) is the process of extracting data from one or multiple sources and loading in into a target data warehouse. The transformation component is pushed to the target database and this helps in improving the performance. This is greatly helpful for processing massive data sets required for business intelligence (BI) and big data analytics. ELT is an alternative and improvised approach to Extract/Transform/Load (ETL). ELT reduces the time data spends in transit and boosts efficiency.

 

A closer look at what happens in each of the ELT stages.

  1. Extract: Raw streams of data from virtual infrastructure, software and applications are ingested either entirely or according to predefined rules.
  2. Load: The mass of raw data is directly delivered to the target storage location. There is no transformation that happens in this step and thus reduces the cycle between extraction and delivery.
  3. Transform: The database or data warehouse sorts and normalizes the data, keeping part or all of it and making it accessible for customizing reporting. The data stored here can be used for relevant business intelligence in near real-time.

An ELT system can be considered as a subset of a broader term called ‘data pipeline’. A data pipeline system governs the moving of data from one system to another. The data transformation may or may not happen, but data is processed in real-time or in batches. Pipelines are primarily required when real-time or highly sophisticated data analysis is done and if the data must be stored in cloud.

Investing in a cost-effective and robust data pipeline is very critical for an organization. The reasons being:

  1. It enables real-time, secure analysis of data, even from multiple sources simultaneously by storing the data in a cloud data warehouse.
  2. Built in error handling means data would not be lost even if the loading fails.
  3. Provides an opportunity to cleanse and enrich data.
  4. Provides an immediate solution and thus saving the lead time involved in building an in-house solution.
  5. Scheme changes and new data sources can be easily incorporated.

Now that we are aware about the usage of data pipelines and its importance, let us look into Azure Data Factory. Azure Data Factory is a cloud-based data integration service that allows one to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. Data itself is not stored in Azure Data Factory, but rather allows to create data-driven workflows to enable the movement of data between supported data stores and processing of data using compute services in other regions or in an on-premise environment.

The approach adopted by Azure Data Factory involve:

  1. Connect and Collect: Connection is established to all the required data sources and processing such as SaaS services, file shares, FTP and web services. Data is moved to a centralized location for subsequent processing with the help of Copy Activity in a data pipeline to move data from both on-premise and cloud data stores to the centralized data store in the cloud.
  2. Transform and Enrich: In the centralized data store in the cloud, data is transformed using compute services like HDInsight Hadoop, Spark, Data Lake Analytics and Machine Learning.
  3. Publish: The transformed data is delivered from cloud to on-premise sources like SQL Server or it is retained in the cloud store for consumption by BI and other analytics tools.

 

The following four components work together to achieve input and output data, processing events and schedule the resources required to execute the desired data flow.

  1. Datasets represent data structures within the data stores: An input and output dataset represent the input and output respectively for an activity in the pipeline. For instance, Azure Blob dataset specifies the container and folder in the Azure Blob Storage, Azure SQL Table dataset specifies the table to which the output data is written by the activity.
  2. Pipeline is a group of activities: Activities are grouped into one unit and a task is performed together. A data factory can have more than one pipelines.
  3. Activities define the actions to perform on the data: Data Factory supports two types of activities – data movement and data transformation.
  4. Linked services define the information needed for Data Factory to connect to external resources: For instance, Azure Storage linked service specifies a connection string to connect to the Azure Storage account.

 

Tools like Azure Portal, Visual Studio, PowerShell, .NET API, REST API and Azure Resource Manager template are used to create data pipeline in Azure Data Factory. One can get started by creating a Data Factory on Azure and then create the four key components with any one of the tools mentioned.

Mapping Data Flows is one of the new features in Azure Data Factory that makes it a complete ELT solution. It combines both control flows and data flows to migrate information in and out of data warehouses. It enables customers to build data transformations with an easy-to-use visual interface, without any need for coding. The data flows are later executed as activities within Azure Data Factory pipelines.