Google BigQuery

Google BigQuery

 

An enterprise data warehouse is a database or collection of databases that centralizes business information from multiple sources and makes it available for analytics. One such enterprise data warehouse is Google BigQuery. Built using BigTable and Google Cloud Platform, it is completely serverless. Complex queries are executed in just a few seconds. This has been used by Google internally for several analytical purposes. BigQuery provides storage up to exabytes and petabyte-scale SQL queries.

 

As any organization grows its business, there is a colossal amount of data to be handled from indefinite number of applications. In order to get effective insights from the data, an efficient analytics tool is required. Data from various sources are put together in a centralized data store and analytics are performed on that. BigQuery allows users to focus mainly on the analytical activities such as querying the business-critical data. In addition to that, the REST API enables businesses to build App-Engine based dashboards and mobile front-ends.

 

Below are few reasons why to use BigQuery:

  1. The hardware set up of a typical data warehouse is replaced completely by BigQuery. BigQuery itself acts as a collective home for all analytical data in the entire organization.
  2. Tables and views in BigQuery function the same way as they do in a traditional data warehouse. This makes it easier to use it.
  3. BigQuery comes with Identify and Access Management (IAM) that allows users to grant permission for specific actions.
  4. BigQuery defines a scheme and issues queries directly on external data as federated data sources.
  5. BigQuery supports a wide variety of data formats like CSV, JSON, Avro, Parquet etc. for ingestion. However, the most preferred data format is Avro which is a binary row-based format that enables BigQuery to split it and read it with multiple worker nodes.
  6. An efficient storage is another key feature of BigQuery. Tables are stored in an optimized columnar format and each table is compressed and encrypted. All the tables accept streaming ingestion. Each table is replicated across multiple data centers and thus making it durable and consistent. This removes the worry about downtime in one region or zone as the data is consistent and durable across regions.
  7. BigQuery comes with built-in strong ML/AI capabilities and provides support for lot of analytical use cases using the AutoML Tables and BigQuery ML. AutoML helps in providing an automated solution and model for the problem with a code-less graphical UI. BigQuery ML helps in Logistic Regression, K-means, Naïve Bayes etc. It is a SQL interface and supports AutoML tables.
  8. BigQuery requires an ultra-fast network which can deliver terabytes of data in seconds directly from storage into compute for running Dremel jobs. This is taken care by the Jupiter network service that provides 1 Petabyte/sec of total bisection bandwidth.

 

Architecture of BigQuery

The architecture of BigQuery is built on Dremel Technology. Dremel is Google’s interactive ad-hoc query system for analysis of read-only nested data. Dremel acts as an execution engine for BigQuery. BigQuery adopts the columnar storage and tree architecture of Dremel. BigQuery utilizes Google’s technologies like Borg, Colossus, Capacitor and Jupiter. A BigQuery client interacts with Dremel engine via a client interface. Google’s large-scale cluster management system named Borg is used that allocates the compute capacity for the Dremel jobs. Dremel jobs reads data from Google’s Colossus file systems using Jupiter network, performs SQL operations and returns the results to the client. Dremel implements a multi-level serving tree to execute queries.

 

BigQuery is designed in a way that separates the concepts of storage (Colossus) and compute (Borg). Both the storage and compute can be scaled independently and thus making it an elastic, economic and more scalable data warehouse.

 

Storage

BigQuery stores data in a columnar format called Capacitor. Each field in a BigQuery table (a column) is stored in a separate Capacitor file that allows it to perform high compression ratio and scan throughput. Data can be imported into BigQuery through either Batch loads or Streaming. While importing, each column is encoded into Capacitor format. After the encoding is completed, it is written back to Colossus. Encoding helps in collecting various statistics about the data that can be later used for query planning.

Capacitor helps store the data in Colossus. Colossus is Google’s generation distributed file system, successor to GFS (Google File Systems). It mainly handles cluster-wide replication, recovery and distributed management. It provides client-driven replication and encoding. During the process of writing data to Colossus, BigQuery makes decision about initial sharding strategy which is evolved based on query and access patterns. Once data writing is completed, BigQuery initiates geo-replication of data across different data centers to enable the highest availability of data. This reduces the risk of non-availability of data even when there is a system downtime in a region.

 

Capacitor and Colossus together form the backbone of BigQuery’s performance. Colossus allows splitting of the data into multiple partitions to enable blazing fast parallel read, while Capacitor reduces the required scan throughput. These two characteristics together helps BigQuery to process a terabyte of data per second.

 

BigQuery supports standard SQL and legacy SQL. Dremel engine uses a semi-flattening tree data structure along with the standard SQL evaluation algorithm to support standard SQL query computation. Infrastructure technologies like Borg, Colossus and Jupiter are the key factors that allow BigQuery to stand out among its counterparts.

#RandomTrees #BigQuery #GoogleBigQuery