

In this blog post, we walk through the design and implementation of a scalable ETL pipeline using Google Cloud Platform (GCP), PySpark on Dataproc, and BigQuery. The case study is based on a real-world scenario involving employee travel records that were originally stored in CSV files on a server.
In enterprise data workflows, it is common for teams to work directly with raw data files like CSVs for reporting or visualization. However, this practice often leads to inconsistent formatting, lack of centralized data governance, and challenges in maintaining data quality. To address these issues, scalable ETL pipelines are introduced, which extract raw data, apply necessary transformations, and load the cleaned dataset into a robust data warehouse.This blog presents a hands-on case study where such a pipeline is implemented using GCP services : Cloud Storage, Dataproc (PySpark), and BigQuery. The dataset contains employee travel records originally stored in CSV format.
We are working on a case study where employee travel records are maintained in CSV files on the server. In the current setup, the visualization team directly accesses these CSV files, which compromises data cleaning and processing.To address this issue, the client has requested a proper ETL process to clean, transform, and load the data into BigQuery. The visualization team will then use BigQuery as the data source for their dashboards.Current Architecture VS Future ArchitectureRequired Transformations
Below is the airline's sample data which contains details of travel made by employees. This file is also attached on the drive : Airline's Sample Data Consider the above data and assume it is placed in your local machine. Build an ETL script in python scripting language to Extract data and apply Transformations and then Load it in any open source database for this assignmentThe dataset used contains travel details of employees. It is stored locally in a CSV file and contains 51 records. The goal is to extract this data, apply necessary transformations, and load it into BigQuery.The data is like below:
We begin by uploading the local CSV file to a GCS bucket using the GCP Cloud SDK.Bucket gsutil URI : gs://casestudy_etlgsutil cp D:\GCP\casestudyETL.csv gs://casestudy_etl/Once uploaded, we verify the file in the GCS bucket.Extract operation Step 1 completed !!
We create a Dataproc cluster and open JupyterLab to write our PySpark script.
If you need some other modules at any point of time, you can import it there.Create spark session:spark = SparkSession.builder.appName("CaseStudy").getOrCreate()
READ: Finally create a dataframe out of the CSV file:Extract Operation completed!!
Formatting the data column : We have 2 columns consisting the date → Booking_date and Travel_date :In Booking_date, we can see that we have different date formats. We need to organise it → clean the date column.Initially, we attempt using substr:Extract the actual date, removing the 00:00 from the column records.To further refine it by handling exceptions:
It has some places Mr. and MR or MRS. Remove all Prefixes. For eg: MR. ABG001 →it should be ABG001 We first try splitting the string:Ohh ohhhh!! It is giving null for some of the records which do not have space in it.Therefore, we need to use when-otherwise.If space is present, then only split the record; or else keep it as it is.
Airline Field has data repetition. Somewhere the value is like Indigo/Indigo. Whereas it should be only Indigo. Here, we will split the record on the basis of ‘/’.
For all the values in the Airlines column, First letter should be in caps and others in small. For eg: Spicejet We will use the function → initcap
For eg: if somewhere Indigo is written and another place Indgo then it should be proper and correct name – Indigo.
Transformations are now completed.
Check in Bigquery:Table has been created !!
These challenges highlighted the importance of flexible scripting and iterative testing in building a reliable ETL pipeline.
This ETL pipeline addresses real-world data quality issues by applying necessary cleaning and standardization steps to employee travel records. Using PySpark on Dataproc and leveraging BigQuery as a data warehouse ensures scalability, performance, and ease of access for visualization and analytics