SNOWFLAKE
A data warehouse is a critical part of any business organization. Lot of cloud-based data warehouses are available in the market today, out of which let us focus on Snowflake.
Snowflake is an analytical data warehouse that is provided as Software-as-a-Service (SaaS). Built on new SQL database engine, it provides a unique architecture designed for the cloud. It stands out among the other enterprise data warehouses by providing lot of features. Like I mentioned, it is a SaaS offering, and hence it makes it a lot more flexible than traditional data warehouse offerings.
The distinctive features of Snowflake as a cloud service are as below:
- Snowflake is built especially for cloud and hence it is not physical. There is no hardware to select, install, configure or manage and that makes it ideal for organizations that do not want to dedicate resources for support and maintenance.
- Ongoing maintenance, management and tuning is handled by Snowflake.
- Snowflake architecture provides flexibility with big data.
- It allows decoupling of the storage and compute functions, which allows the organizations to conveniently scale up or down as needed and pay only for the resources that are used.
- Snowflake enables seamless sharing of data among the data consumers whether they are customers of Snowflake or not. Snowflake provides reader accounts that can be directly created from the user interface. This functionality allows the provider to create and manage a Snowflake account for a consumer.
- Snowflake resolves concurrency issues by using multicluster architecture in which queries from one warehouse never affect the queries from another. Also, each of the virtual warehouses can scale up and down as per their requirement.
- A combination of structured and semi structured data can be used for analysis and loaded into the cloud database without the need of transforming into a fixed relational scheme first.
Here’s a detail on the architecture of Snowflake.
The architecture is three layered:
- Database Storage:
Snowflake has a mechanism to reorganize the data into its internal optimized, compressed and columnar format and stores this optimized data in cloud storage. This stage handles all the aspects of data storage like organization, file size, structure, compression, metadata, statistics. The data objects are accessible only through SQL query operations run using Snowflake.
- Query Processing:
Query processing in Snowflake is done using virtual warehouses. A virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider. Each virtual warehouse acts independently and does not share compute resources with other virtual warehouses. Hence each warehouse has no impact on the performance of the other.
- Cloud Services:
This layer is a collection of services that coordinate activities across Snowflake. This ties together all the different components of Snowflake in order to process users request, from login to query dispatch. The services provided in this layer include authentication, infrastructure management, metadata management, query parsing and optimization, access control etc.
Snowflake Data Accelerators
Quick analysis of data can only be done by loading the data first without any dependencies and delays. Once the data is loaded, analysis can be done quickly using Machine Learning and predictive analysis to produce useful insights. This process also requires a dedicated resource for ELT development and operations. This workflow increases cost, introduces redundancy in efforts and creates inconsistency.
The data ingestion cycle usually comes with a few challenges like high data ingestion cost, longer wait time before analytics is performed, varying standard for data ingestion, quality assurance and business analysis of data not being sustained, impact of change bearing heavy cost and slow execution.
As a solution to overcome the above challenges, one can create a sustainable data ingestion framework that brings reusability across the enterprise and provides a consistent standard, eliminating the need to maintain several projects or programs.
The Data Load Accelerator meets the above-mentioned solution. It provides an intelligent framework that can reduce and eliminate the ELT coding efforts, consolidate management, shorten the development cycles, and support complicated data load requirements.
The key features of the Data Load Accelerator include:
- Minimal and reusable coding: The model used is configuration-based and all data load requirements will be managed with one code base.
- Snowflake allows the loading of both structured and semi-structured datasets from cloud storage.
- Snowflake along with Talend, Azure/ AWS capabilities provides a dynamic profile driven architecture.
- The need to learn a new language is eliminated as it can be managed with SQL alone.
- Workflows are designed and controlled effectively by controlling the orchestration activities, multi-thread them and manage dependencies between the threads.
- There is consistent adoption to data loading, meta data management, CI/CD and DevOps standard for the enterprise.
The Data Load Accelerator works with the Cloud Storage layer for ingesting data into Snowflake. The accelerator provides two executable components that can run with a dependency or even be de-coupled.
- Extract and load component copy the data into Snowflake.
- Transform component manages data penetration and transformations for complex business requirements.
For seamless maintenance and support, automated reports are used that collect Talend and Snowflake logs to provide information on data load statistics, errors and audit findings. Accelerator is designed to auto-detect past failure instances and customizing the workflow in real time.
RandomTrees has come up with an accelerator named D-Fast that can help you with fast-track data migration during Snowflake implementation with an approach focused on data quality, cost-effectiveness and business value.
The key features include:
- Rapid migration of data from SAP BW and HANA.
- End to end data migration approach with pre-configured data pipelines.
- Automated data cleansing and predefined data validation.
- Reduced costs and risks, accelerated schedules, increased visibility and enhanced ROI.
- Prepare, extract, design, map, transform and load are the efficiencies driven by D-Fast which results in a total cost savings of ~30 – 40%.