Designing a Scalable ETL Pipeline on GCP with Dataproc and BigQuery

In this blog post, we walk through the design and implementation of a scalable ETL pipeline using Google Cloud Platform (GCP), PySpark on Dataproc, and BigQuery. The case study is based on a real-world scenario involving employee travel records that were originally stored in CSV files on a server.

Introduction

In enterprise data workflows, it is common for teams to work directly with raw data files like CSVs for reporting or visualization. However, this practice often leads to inconsistent formatting, lack of centralized data governance, and challenges in maintaining data quality. To address these issues, scalable ETL pipelines are introduced, which extract raw data, apply necessary transformations, and load the cleaned dataset into a robust data warehouse.This blog presents a hands-on case study where such a pipeline is implemented using GCP services : Cloud Storage, Dataproc (PySpark), and BigQuery. The dataset contains employee travel records originally stored in CSV format.

Case Study Overview

We are working on a case study where employee travel records are maintained in CSV files on the server. In the current setup, the visualization team directly accesses these CSV files, which compromises data cleaning and processing.To address this issue, the client has requested a proper ETL process to clean, transform, and load the data into BigQuery. The visualization team will then use BigQuery as the data source for their dashboards.Current Architecture VS Future ArchitectureRequired Transformations

  1. All date column values should follow a uniform date format: dd-mm-yyyy.
  2. Pax_Name field values should be cleaned. Prefixes such as "Mr.", "MR", or "MRS" must be removed. For example, MR. ABG001 should become ABG001.
  3. The Airline field has data repetition. For instance, some values appear as Indigo/Indigo, which should be simplified to just Indigo.
  4. All values in the Airline field should have only the first letter capitalized.
  5. Inconsistent naming in the Airline field should be standardized. For instance, both Indgo and Indigo should be transformed to Indigo.

Dataset and Environment Setup

Below is the airline's sample data which contains details of travel made by employees. This file is also attached on the drive : Airline's Sample Data Consider the above data and assume it is placed in your local machine. Build an ETL script in python scripting language to Extract data and apply Transformations and then  Load it in any open source database for this assignmentThe dataset used contains travel details of employees. It is stored locally in a CSV file and contains 51 records. The goal is to extract this data, apply necessary transformations, and load it into BigQuery.The data is like below:

Step 1: Upload CSV to Google Cloud Storage (GCS)

We begin by uploading the local CSV file to a GCS bucket using the GCP Cloud SDK.Bucket gsutil URI : gs://casestudy_etlgsutil cp D:\GCP\casestudyETL.csv gs://casestudy_etl/Once uploaded, we verify the file in the GCS bucket.Extract operation Step 1 completed !!

Step 2: Initialize Spark and Load CSV into DataFrame

We create a Dataproc cluster and open JupyterLab to write our PySpark script.

Import Required Modules

If you need some other modules at any point of time, you can import it there.Create spark session:spark = SparkSession.builder.appName("CaseStudy").getOrCreate()

Define Variables

We need a dataset in Bigquery. Here my dataset is casestudy. EmpTravelRecords is the table name which I want to get created when we load data into BQ.

READ: Finally create a dataframe out of the CSV file:Extract Operation completed!!

Step 3: Transform the Data Using PySpark

1. Standardize Date Format to dd-mm-yyyy

Formatting the data column : We have 2 columns consisting the date → Booking_date and Travel_date :In Booking_date, we can see that we have different date formats. We need to organise it → clean the date column.Initially, we attempt using substr:Extract the actual date, removing the 00:00 from the column records.To further refine it by handling exceptions:

2. Clean Pax_Name Field

It has some places Mr. and MR or MRS. Remove all Prefixes. For eg: MR. ABG001 →it should be ABG001 We first try splitting the string:Ohh ohhhh!! It is giving null for some of the records which do not have space in it.Therefore, we need to use when-otherwise.If space is present, then only split the record; or else keep it as it is.

3. Remove Redundant Airline Values

Airline Field has data repetition. Somewhere the value is like Indigo/Indigo. Whereas it should be only Indigo. Here, we will split the record on the basis of ‘/’.

4. Properly Capitalize Airline Names

For all the values in the Airlines column, First letter should be in caps and others in small.               For eg: Spicejet We will use the function → initcap

5. Fix Misspelled Airline Names

For eg: if somewhere Indigo is written and another place Indgo then it should be proper and correct name – Indigo.

6. Remove Duplicates and Null Values

Transformations are now completed.

Step 4: Load Cleaned Data into BigQuery

After running this code, the table EmpTravelRecords is successfully created in BigQuery.

Check in Bigquery:Table has been created !!

Key Learnings and Challenges Faced

  • Data Format Inconsistency: Date columns had inconsistent formats, requiring fallback strategies using both to_date and substr methods.
  • Unstructured Text Cleaning: Fields like Pax Name and Airline contained titles, abbreviations, or repeated values that needed conditionally applied logic with split and when functions.
  • Handling Nulls and Duplicates: Dropping nulls and duplicates ensured only clean records entered BigQuery but required thoughtful ordering to avoid premature data loss.
  • Standardization through Mapping: Common spelling errors or naming variations in airline names were resolved using a dictionary-based replacement approach.
  • PySpark and BigQuery Integration: Ensuring the correct temporary bucket and table configurations for seamless loading into BigQuery demanded precise setup.

These challenges highlighted the importance of flexible scripting and iterative testing in building a reliable ETL pipeline.

Conclusion

This ETL pipeline addresses real-world data quality issues by applying necessary cleaning and standardization steps to employee travel records. Using PySpark on Dataproc and leveraging BigQuery as a data warehouse ensures scalability, performance, and ease of access for visualization and analytics

Related Articles