DBT with Snowflake

Snowflake is an analytical data warehouse that is provided as Software-as-a-Service (SaaS). Built on new SQL database engine, it provides a unique architecture designed for the cloud. It stands out among the other enterprise data warehouses by providing lot of features. Like I mentioned, it is a SaaS offering, and hence it makes it a lot more flexible than traditional data warehouse offerings.

The distinctive features of Snowflake as a cloud service are as below:

  1. Snowflake is built especially for cloud and hence it is not physical. There is no hardware to select, install, configure, or manage and that makes it ideal for organizations that do not want to dedicate resources for support and maintenance.
  2. Ongoing maintenance, management and tuning is handled by Snowflake.
  3. Snowflake architecture provides flexibility with big data.
  4. It allows decoupling of the storage and compute functions, which allows the organizations to conveniently scale up or down as needed and pay only for the resources that are used.
  5. Snowflake enables seamless sharing of data among the data consumers whether they are customers of Snowflake or not. Snowflake provides reader accounts that can be directly created from the user interface. This functionality allows the provider to create and manage a Snowflake account for a consumer.
  6. Snowflake resolves concurrency issues by using multicluster architecture in which queries from one warehouse never affect the queries from another. Also, each of the virtual warehouses can scale up and down as per their requirement.
  7. A combination of structured and semi structured data can be used for analysis and loaded into the cloud database without the need of transforming into a fixed relational scheme first.

 

Here’s a detail on the architecture of Snowflake.

The architecture is three layered:

  1. Database Storage:

Snowflake has a mechanism to reorganize the data into its internal optimized, compressed and columnar format and stores this optimized data in cloud storage. This stage handles all the aspects of data storage like organization, file size, structure, compression, metadata, statistics. The data objects are accessible only through SQL query operations run using Snowflake.

  1. Query Processing:

Query processing in Snowflake is done using virtual warehouses. A virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider. Each virtual warehouse acts independently and does not share compute resources with other virtual warehouses. Hence each warehouse has no impact on the performance of the other.

 

  1. Cloud Services:

This layer is a collection of services that coordinate activities across Snowflake. This ties together all the different components of Snowflake in order to process users request, from login to query dispatch. The services provided in this layer include authentication, infrastructure management, metadata management, query parsing and optimization, access control etc.

 

 

DBT (Data Build Tool) is a command line tool that enables analysts to transform data in a data warehouse more effectively. A lot of data transformation is required between loading data into the warehouse and analyzing it. Dbt allows that process to anyone with a little knowledge of SQL.

 

When DBT is used in Snowflake, data transformation can be achieved more easily.

The use cases for using DBT with Snowflake include:

  1. Suppose there is a need to change 100 tables into view. The entire process is very time-consuming using snowflake alone as it uses SQL. But with the help of DBT, this can be achieved easily by changing the materialization in a single file.
  2. DBT supports version controlling like Git. Therefore, conflicts between one or more developers working on a same object are nullified.
  3. DBT comes with a lot of pre-defined packages for various functions like getting the current timestamp, truncating data, fetching last day, splitting text by a delimiter, getting a unique column value, generate surrogate keys, pivot/ unpivot values from rows to columns.
  4. Development time can be reduced by using DBT as it offers version control, data validation, quality assurance, auto-documentation, and lastly, provides packages that support fast development.

 

Though there are plenty of data transformation tools like Azure Data Factory, Apache Kafka, SAP Data Services etc, are available in the market, DBT stands out because its open source and lot more approachable than the rest.