Blog Part 2: Orchestrating SQL-based Transformations with Airflow in GCP
Introduction
In Part 1, we covered how to set up the GCP environment, create datasets, and prepare the schema for our social media project. Now in Part 2, we’ll focus on building an Apache Airflow DAG that automatically reads SQL files from Cloud Storage and executes them in BigQuery. This approach simplifies transformation logic and brings automation into the data pipeline.
Upload and Configure Airflow DAG
1. Open Airflow and Upload DAGs
- Click on your Composer Environment → Open Airflow UI
- Note the GCS path (like gs://your-bucket/dags/)
- Upload the following DAG file (provided below):
- Save this file as social_media_transform_dag.py
- Upload it to the Composer DAG folder: gs://your-bucket/dags/
- gs://us-central1-social-media-en-91eb1f46-bucket/dags
2. Upload SQL Files
- Use gs://your-composer-bucket/sql/ folder (Composer-managed bucket) to store all your SQL transformation scripts.
- Upload all SQL files used in your project (e.g., top_hashtags.sql, inactive_users.sql, comments_per_user.sql, etc.)
- If the sql/ folder does not exist, create it manually in the bucket.
- These are referenced by BigQueryInsertJobOperator inside DAG
3. Trigger and Monitor Your DAG
- Once your DAG is uploaded, go to the Airflow UI.
- Toggle the DAG to ‘on’ and click ‘Trigger DAG’.
- Watch the graph view and logs as each task (SQL file) runs.
- Details of the DAG execution
- Graph view of the DAG execution
- Task logs showing successful SQL execution
- Tip: You can set do_xcom_push=True in your BigQueryInsertJobOperator to pass query results (like row counts or dynamic values) into downstream tasks using XCom.
- Example:
Stay tuned for Part 3 where we perform analytical queries and visualize the results using Looker Studio !
Leave a Reply