Skip to content

Latest commit

 

History

History
44 lines (34 loc) · 1.13 KB

File metadata and controls

44 lines (34 loc) · 1.13 KB

Airflow Data Insgestion Pipeline

Architecture

Tasks

  • Design overall data pipeline architecture
  • Define and configure Airflow DAG for orchestration
    • Extract and load raw taxi data to Amazon S3
    • Transform raw data into structured format
    • Convert transformed data to Delta format
    • Persist transformed data to PostgreSQL
  • Configure Trino to connect to Delta Lake on S3
  • Manage infrastructure with Terraform modules
    • Provision Amazon S3 bucket
    • Provision EC2 instance
  • Set up CI for Pull Requests (e.g., GitHub Actions)

Pipeline

Prequisites

Setup Infrastructures

1. Setup Python Environment

make install

Troubeshoot

    - ./.env:/opt/airflow/.env
    ~> dotenv_path = Path(__file__).resolve().parent.parent.parent / ".env"

s3fs module not found

pip uninstall aiobotocore
pip install --upgrade botocore boto3 s3fs

References