This project demonstrates an end-to-end data engineering pipeline combining:
- Batch processing
- Real-time streaming (Kafka)
- ELT transformations (dbt)
- Orchestration (Airflow)
- Python – ETL scripts & Airflow DAGs
- PostgreSQL – OLTP and raw data storage
- Kafka – Real-time event streaming
- dbt – Transformations, incremental models, and tests
- Airflow – Orchestration of batch & streaming pipelines
- Docker – Containerized infrastructure
┌─────────────┐
│ Source DB │ ← optional initial batch ingestion
└─────┬───────┘
│
▼
┌─────────────┐
│ Batch ETL │ ← Airflow DAG task: batch_ingestion
│ raw_user_activity table
└─────┬───────┘
│
▼
┌─────────────┐
│ Kafka │ ← Producer generates events
│ Topic: │
│ user_activity
└─────┬───────┘
│
▼
┌─────────────┐
│ Kafka │ ← Consumer reads from topic
│ Consumer │
│ Writes to │
│ raw_user_activity table
└─────┬───────┘
│
▼
┌─────────────┐
│ dbt Models │
│ staging → │
│ intermediate → marts
└─────┬───────┘
│
▼
┌─────────────┐
│ dbt Tests │
│ Data quality│
└─────────────┘
DAG Name: telecom_pipeline
Execution Order:
batch_ingestion → kafka_producer → kafka_consumer → dbt_run → dbt_test
-
Batch Ingestion (
batch_ingestion)- Pulls initial data from source or CSV
- Inserts into
raw_user_activitytable in PostgreSQL
-
Kafka Producer (
kafka_producer)- Generates synthetic user events (calls, SMS, internet usage)
- Publishes messages to Kafka topic
user_activity
-
Kafka Consumer (
kafka_consumer)- Reads messages from Kafka topic
user_activity - Inserts events into
raw_user_activitytable in PostgreSQL
- Reads messages from Kafka topic
-
DBT Transformations (
dbt_run)- Staging:
stg_user_activity - Intermediate:
int_user_activity - Marts:
mrt_user_metrics
- Staging:
-
DBT Tests (
dbt_test)- Validates not-null and unique constraints
- Ensures transformed data quality
- PostgreSQL schemas:
raw,staging,intermediate,marts - Airflow metadata DB uses a separate Postgres instance:
airflow_postgres
- Topic:
user_activity - Zookeeper & Kafka run in Docker alongside Airflow
- Configured in
.envfile - Loaded in DAG and ETL scripts via
python-dotenv
- PostgreSQL volumes:
pgdata,airflow_pgdataensure schemas survive Docker restarts - Do not run
docker-compose down -vif you want to preserve data
