Building ETL Pipeline with BigQuery and Dataflow

A comprehensive guide to data ETL process using GCP services

Featured image

Image Reference



Overview

Following our previous post about BigQuery, let’s explore how to build an ETL (Extract, Transform, Load) pipeline using BigQuery and Dataflow. Our workflow will follow this path:

MongoDB → BigQuery → Google Sheets


GCP Resources Used

What is Google Cloud Dataflow?

Google Cloud Dataflow is a fully managed streaming and batch data processing service provided by GCP.

It’s built on Apache Beam and provides methods for building, deploying, and running data processing pipelines for both real-time and batch processing use cases.

Key Features


What is Google Cloud Functions?

Google Cloud Functions (GCF) is a serverless computing service that allows developers to run code triggered by cloud events without managing servers.

Key Features


What is Google Cloud Scheduler?

Cloud Scheduler is a fully managed cron job scheduling service that’s similar to Linux cron but with added cloud capabilities.

Key Features


ETL Workflow Implementation

The implementation follows this sequence:

flowchart TD Z["Enable API Service"] --> A["Create BigQuery Service Account and Grant Permission"] A --> B["Create Google Sheet and Grant Permission"] B --> C["Create BigQuery Dataset"] C --> D["Create Storage for Cloud-function"] D --> E1["mongodb → bigquery table workflow"] E1 --> F1["Compress Zip file (source code)"] E1 --> G1["Upload Zip file (source code) to bucket"] E1 --> H1["Create cloud function"] E1 --> I1["Create cloud scheduler"] D --> E2["bigquery table → googlesheet workflow"] E2 --> F2["Compress Zip file (source code)"] E2 --> G2["Upload Zip file (source code) to bucket"] E2 --> H2["Create cloud function"] E2 --> I2["Create cloud scheduler"]


Infrastructure as Code (IaC) Implementation

The pipeline can be built using two IaC tools:

Terraform Implementation

Source code available at: terraform-infra-gcp

Directory structure and implementation details are available in the repository's README.
somaz-bigquery-project/
├── README.md
├── bigquery-to-google-sheet
│   ├── bigquery.json
│   ├── main.py
│   ├── main.py.all_data
│   ├── main.py.date
│   ├── main.py.single_db
│   ├── main.py.time_UTC
│   └── requirements.txt
├── bigquery.tfvars
├── cloud-storage.tf
├── locals.tf
├── mongodb-bigquery-googlesheet-workflow.tf
├── mongodb-to-bigquery
│   ├── MongoDB_to_BigQuery
│   ├── main.py
│   ├── main.py.local
│   ├── main.py.single_db
│   └── requirement.txt
├── provider.tf
├── terraform-backend.tf
└── variables.tf

Pulumi Implementation

Source code available at: pulumi-study

Both implementations follow the same resource creation order and can be customized using different versions of main.py for different use cases.
bigdata-flow-functions/
├── Pulumi.yaml
├── README.md
├── __main__.py
├── bigquery-to-google-sheet
│   ├── bigquery.json
│   ├── main.py
│   ├── main.py.all_data
│   ├── main.py.date
│   ├── main.py.single_db
│   ├── main.py.time_UTC
│   └── requirements.txt
├── bq_dataset.py
├── bq_sheet_archive.py
├── bq_sheet_function.py
├── config.py
├── mdb_bq_archive.py
├── mdb_bq_function.py
├── mongodb-to-bigquery
│   ├── main.py
│   ├── main.py.local
│   ├── main.py.single_db
│   └── requirements.txt
├── requirements.txt
├── scheduler_manager.py
├── storage.py
└── utils.py



References