Introduction
Databricks has rebranded Jobs as Lakeflow Jobs to support modern, end-to-end workflow orchestration with DAG-based task flows, scheduling, retries, and monitoring — all in a unified interface.
Why has the Name Changed?
Databricks renamed Databricks Jobs to Lakeflow Jobs to better reflect its modern Data Orchestration features like DAG-based task flows, scheduling, and monitoring. The new name aligns with Databricks’ Lakehouse vision, highlighting seamless pipeline Workflow Automation and data flow.
Previously, Job orchestration was handled separately from Lakeflow, with limited integration with pipeline-level features.
Now, Lakeflow Jobs offers
- Enhanced Data Orchestration
- Metadata tracking
- Tighter integration with declarative pipelines
- Unifies scheduling, monitoring, and Workflow Automation in one scalable, consistent experience.
Introduction to Databricks: Unified Platform for Data & AI
Databricks is a cloud platform for Data Engineering, analytics, and AI, built on Apache Spark. It lets teams build pipelines, process big data, and run ML models — all in one place, with easy cloud integration.
A Databricks Job (now called Lakeflow Job) automates these workflows by running notebooks, SQL, or scripts on a schedule, with support for task dependencies, retries, and alerts.
What is Lakeflow, Lakeflow Job and How Do Lakeflow Jobs Work?
Databricks Lakeflow is Databricks’ built-in orchestration tool that helps you automate data workflows visually or through code. You can connect notebooks, SQL, or scripts into a sequence of tasks, with built-in features like scheduling, retries, and alerts.
A Lakeflow Job is a complete pipeline built using this Databricks Lakeflow engine. It defines what runs, when it runs, and in what order — ideal for automating ETL processes, ML models, or reports. Once set up, Lakeflow Jobs run everything automatically, giving you full visibility into each step — no manual triggers or extra orchestration tools needed.
End-to-End ETL Workflow Using GCP and Databricks Lakeflow Job Orchestration
Built a Lakeflow-style ETL Pipeline on Databricks (GCP Free Trial) using modular notebooks, GCS access via service account, Delta tables, 15-min scheduling, retries, and email alerts.
Technologies Used:
- Google Cloud Platform (GCP): Free trial environment for cloud infrastructure
- Databricks: Compute and orchestration layer
- Apache Spark: Distributed processing engine (via Databricks)
- Google Cloud Storage (GCS): Source for input Parquet files
- Unity Catalog: Secure table and credential management
- Lakeflow Jobs (formerly Databricks Jobs): DAG-style workflow scheduling
- PySpark Notebooks: Logic for data ingestion, transformation, and output
- Delta Tables: Storage format for durable, scalable data management
While Databricks Community Edition is free, it has key limitations — no job scheduling, Unity Catalog, or Lakeflow support. For this project, I needed access to Lakeflow Jobs, email alerts, Unity Catalog, and a realistic ETL setup — all of which are included in the Databricks in GCP. That’s why I chose Databricks — perfect for building and automating production-grade pipelines.
Workflow Architecture:
The pipeline reads Parquet data from GCS using a service account, then processes it through modular PySpark notebooks in Databricks. Tasks are automated with a Lakeflow Job using DAG-style flow, scheduled every 15 minutes with retries and email alerts. Final outputs are stored as Delta tables in Unity Catalog.
Automating an Election Data Pipeline:
This blog covers the creation of an automated Data Pipeline in Databricks using a Lakeflow Job with DAG-style orchestration for Election Data Analytics. It reads Parquet files from Google Cloud Storage (GCS), processes them using PySpark notebooks, and stores the results as Delta tables managed through Unity Catalog.
Each task runs in sequence with DAG-based dependencies, scheduled every 15 minutes, and monitored via email alerts.
Datasets Used in This Project:
This project uses three Parquet datasets: Voter Demographics, voting records, and election results, stored in Google Cloud Storage.
- Voter Demographics: age, gender, income, education, region.
- Records: whether they voted, how, and when
- Results: candidate, party, total votes, winner, region.
With Lakeflow Jobs in Databricks, we organize them into bronze, silver, and gold layers to uncover insights like how income or education impacts voter turnout.
Create a Databricks in GCP
GCP Console → Select project → Search “Databricks” in top bar.
Google Cloud Marketplace > GCP Databricks > Subscribe → Enter workspace name, region, and project.
After subscribing → Go to GCP Databricks sign-up → Use same Gmail as GCP project.
Login & verify → Refresh → Click Manage on Provider → Select plan → Continue.
Hit Continue to move to the GCP Databricks workspace setup page.
Step 1: Set Up Databricks Workspace
1.Create Your Databricks Workspace in the UI
Create Workspace → Name (e.g., db-gcp-trial) → Region: us-central1 → Project ID → Network: Databricks Managed VPC → IP: default/10.0.0.0/16 → Click Save.
2.Create a Folder in Workspace
Workspace tab → Users > your Gmail → Right-click → Create > Folder → Name it (e.g., GCS_Access).
3.Create a Notebook
Right-click folder → Create > Notebook → Name: gcs_auth_test → Language: Python → Select/create cluster → Click Create.
Step 2: Connect GCS to Databricks Using a Service Account
- Set Up GCP Service Account for Secure GCS Access
Create a GCP service account with the Storage Object Viewer role, download the JSON key, and grant it permissions to read from your GCS bucket. This enables secure read access from Databricks.
2.Upload Service Account JSON to Databricks
Data > Catalogs > [Your Catalog] → + Create Schema → Name: gcs_credentials → Use default/custom storage.
Schema: election_data → Click Upload > Browse Files → Upload service account JSON key.
When prompted, create a managed volume named gcp_credentials to store the key file.
The file will be uploaded to:/Volumes/db_gcp_trail/election_data/gcp_credentials/project-xxxx.json, which can now be referenced in your Databricks notebooks for secure GCS access.
Unity Catalog is Databricks’ unified governance layer for managing data, permissions, and metadata. It securely organizes access to tables, files, and credentials across all of your workspaces.
3.Validate GCS Access from Databricks
To enable secure access to your GCS bucket, start by configuring Spark to use your service account credentials stored in Unity Catalog. This ensures that all file operations in your notebook are properly authenticated.
Run dbutils.fs.ls(“gs://your-bucket-name/”) → Lists files if GCS access is successful.
4.Load 4.All Parquet Files from GCS: Load Parquet files from GCS into DataFrames using spark.read.parquet() and preview them with.show().