Redshift as Modern Data Warehouse

Redshift as Modern Data Warehouse

 

Data analysis of business information, data processing, data mining, predictive analysis etc. are few of the many functions that a business enterprise follows. A specific system is used to collectively perform these functions. A data warehouse is one such system that allows you to store data from one or more sources. The ability to store current and historical data in one place allows one to perform easier data analysis and reporting.

Data warehouses are classified into 3 types:

  1. Enterprise Data Warehouse (EDW): Acts as a centralized data warehouse and provides a unified data solution across the organization.
  2. Operational Data Store: Used when the data needs to be refreshed in real time and for routine activities like storing records of employees.
  3. Data Mart: It is a subset of a data warehouse. Data can be collected from different sources and is mostly used for a particular stream like sales, finance etc.

Apart from the traditional data warehouse, we also have cloud-based data warehouses that are cost effective, quicker, easily scalable and perform complex analytical queries a lot faster as they use massively parallel processing (MPP).

 

Based on the organization’s needs, an appropriate approach is made in choosing the right data warehouse system. The top cloud-based data warehouses available in the market currently include Amazon Redshift, Azure Synapse Analytics, Google BigQuery, Oracle Autonomous Data Warehouse, IBM Db2 Warehouse on Cloud.

Let us take a deeper look into what Amazon Redshift offers as a data warehouse.

 

Redshift is currently powering analytical workloads for Fortune 500 companies, start ups and lot of companies in between, which indicates that it offers few critical features as a data warehouse. Amazon Redshift is a fully managed, petabyte-scale, simple and cost-effective data warehouse service. Redshift is 10 times faster compared to other data warehouses and setting up and deploying of the data warehouse can be done very quickly and easily. Redshift can achieve 10 times faster performance by using machine learning, massively parallel query execution and columnar storage on high-performance disks. Redshift was also one of the first MPP Data Warehouse solution architected for the cloud.

 

Being one of the fastest growing AWS services, Redshift comes with a lot of efficiencies and capabilities. Top enterprises like Johnson and Johnson, NASDAQ, Amgen etc. have migrated to Amazon Redshift.

Redshift lays the perfect foundation for a modern analytics pipeline by allowing the organization to automate a lot of administrative tasks. As mentioned, the set up and deploy time is very minimal. It delivers fast query performance, improves I/O efficiency and can be scaled up and down as per performance and capacity need changes. Petabytes of data can be queried and still the results would be generated in few seconds. The petabyte scaling would still cost less than the traditional data warehouse solutions.

Redshift provides the flexibility to query seamlessly across data warehouse and data lake. A data lake is basically a vast pool of raw data. Azure Data Lake is a scalable data storage and analytics service. With data warehouse on AWS, beyond data stored on local disks in the data warehouse, vast amounts of unstructured data can be queried in the Amazon S3 Data Lake without transforming the data. Amazon Redshift provides another feature called ‘Spectrum’. This feature gives you the freedom to store data wherever desired, in any format you want and made readily available for processing any time it is required. One can query the data in Spectrum just as querying in local disks in Amazon Redshift, except the absence of the need to ingest data into Amazon Redshift cluster. Redshift Spectrum leverages the powerful Redshift Query Optimizer and automatically scales to thousands of nodes. Data formats can be anything from CSV, TSV, Parquet, Sequence and RCFile.

 

Migrating to Redshift from an On-premise data warehouse can be done through different strategies based on the following factors:

  1. Network bandwidth between the source server and AWS.
  2. Whether the migration and switchover to AWS will be done in one step or as a sequence of steps over time.
  3. The rate of data change in the source system.
  4. Transformations needed during migration.
  5. The partner tool that you plan to use for migration and ETL.

 

Further the migration can be done in one-step or two-step. One-step migration is feasible for small databases that do not require continuous operation. One can extract existing databases as CSV files and with the help of services like AWS Import/Export Snowball, the dataset can be delivered to Amazon S3 for loading into Amazon Redshift. Two-step migration is for databases with larger size. The first step is the initial data migration where the data is extracted from source database and then migrated to Amazon Redshift, is achieved by following the one-step migration approach. The second step involves changed data migration where the data that changed in the source database after the initial data migration is propagated to the destination before switchover. This step is to synchronize the source and destination databases.

Data migration can also be self-serviced with the help of several tools and technologies. Few organizations might have built custom schemes over time, might have data spread out in a variety of online and offline sources, might have mission critical applications dependent on the existing data warehouse. In such scenarios, an AWS Partner Network System Integration and Consulting partner can provide specialized resources to ensure migrations goes smoothly.

#RandomTrees  #Redshift #Datawarehouse