My Data Engineering Roadmap as a Data Science Student

My Data Engineering Roadmap as a Data Science Student

19 October 2025·
Karhomatul Faqih Al Amin
Karhomatul Faqih Al Amin
· 5 min read
Image credit: Photo by Wes Hicks on Unsplash

After sharing my motivation for pursuing Data Engineering, I wanted to put that passion into a structured plan. Something that connects what I learn in university, online courses, and real-world practice.


Overview

I designed this roadmap to align with eight semesters of my Data Science degree, combining theory from coursework and technical depth from self-learning. Each semester focuses on a specific layer of data engineering, from local experimentation to production-grade cloud systems.

This roadmap isn’t just a checklist, it’s a direction map. Flexible enough to adapt as I grow, but clear enough to keep me focused.

As a Data Science student, my goal is to build solid foundations first, automate second, and scale last.


Roadmap Table

Foundation (Phase 1-2) ✅

PhasePhase 1 - Foundation Basics (2 months)Phase 2 - Local ETL & Modeling (2 months)
Status✅ Completed✅ Completed
Main FocusPython & SQL FundamentalsDatabases & Local Data Handling
Learning ResourcesCisco - (Python Essentials 1, Linux Unhatched), Codecademy - Learn SQL, Codedex - Learn Git & GithubDatabricks Academy, pandas docs, PostgreSQL tutorial, The Data Warehouse Toolkit (3rd Edition) by Ralph Kimball
Key Skill & TopicsBasic statistics, Python basics, SQL fundamentals, Intro to Git & LinuxCSV/JSON ingestion, Pandas ETL, Bash automation, Data modeling, Normalization, Basic ETL patterns, Makefile-based orchestration
Tools / PlatformPython, Jupyter Notebook, GitHub, Linux Shell, PostgreSQL/MySQLPostgreSQL/MySQL, GitHub, Pandas & NumPy, Bash CLI
Project / Portfolio OutputProject 1: Movie Data ETL PipelineProject 2: E-commerce Data Pipeline

Cloud (Phase 3-4) ⏳

PhasePhase 3 - Cloud Infrastructure & Serverless Pipelines (4 months)Phase 4 - Big Data & Advanced Orchestration (5 months)
Status⏳ In Progress🗓️ Planned
Main FocusCloud infra, Containerized workloads, Event-driven pipelinesBig data, Cloud ETL & Basic streaming
Learning Resources
Key Skill & TopicsIAM (least privilege), S3 (raw/staging zones), VPC & networking basics, Dockerized Python apps, Terraform IaC, Lambda execution model, SQS-based decoupling, Idempotent ETL, CI-ready infraSpark DataFrames, Distributed processing, Glue ETL workflows, Redshift integration, dbt SQL modeling, Airflow DAG orchestration, Data quality frameworks
Tools / PlatformAWS S3, IAM, VPC, Docker, Python, Terraform, AWS Lambda, SQS, GitHub Actions (basic)Apache Spark (PySpark), AWS Glue, Redshift, Airflow/Step Functions, Kafka/Kinesis, dbt
Project / Portfolio OutputProject 3: End-to-end cloud data pipeline (Docker + Lambda) deployed fully via TerraformProject 4: Cloud ETL, raw → Glue → Redshift/S3 + basic streaming ingestion + data quality checks

Advanced (Phase 5-6) 🗓️

PhasePhase 5 - DevOps & Production (4 months)Phase 6 - Lakehouse & Capstone (8 months)
Status🗓️ Planned🗓️ Planned
Main FocusCI/CD, Observability & Container orchestrationLakehouse, Advanced streaming & ML integration
Learning Resources
Key Skill & TopicsInfrastructure automation, Secret management, GitHub Actions pipelines, Test workflows, Monitoring & logging, Container orchestration (K8s)Delta Lake/Hudi, ACID ingestion, Spark optimization, Incremental loads, Real-time streaming (Flink/Spark Streaming), Data governance, Feature stores, ML pipeline integration
Tools / PlatformTerraform, GitHub Actions, CloudWatch, Docker, KubernetesDatabricks, Delta Lake, Flink/Spark Streaming, Great Expectations, MLflow, Terraform, Airflow, Power BI/Metabase
Project / Portfolio OutputProject 5: Production-grade ETL with CI/CD, tests, monitoring, containerized deploymentCapstone: Full data platform — ingestion → processing → catalog → quality → dashboard + ML pipeline

How I’ll Use This Roadmap

  • As a compass, to keep me aligned with my long-term goal: becoming a professional Data Engineer.

  • As a progress tracker, to document what I’ve learned and what still needs improvement.

  • As content inspiration, for future blog posts and portfolio updates.

This roadmap isn’t final, it’s something I’ll refine every semester as I gain experience through university projects, certifications, and my ongoing work in the data industry.


What’s Next

Phase 1 and Phase 2 are completed, covering foundational skills and local ETL workflows—from basic Python and SQL to structured data modeling and reproducible pipelines.

I am currently in progress with Phase 3, focusing on Cloud Infrastructure & Serverless Pipelines. This phase marks the transition from local environments to cloud-native systems, emphasizing infrastructure-as-code, containerized workloads, and event-driven data pipelines.

At the end of Phase 3, I plan to document key learnings around cloud infrastructure design, deployment trade-offs, and lessons learned when moving ETL workloads into a serverless environment.


Roadmap Update Log

v2.2 Phase Progress & Vertical Layout Update (February 2026)

  • Roadmap structure adjusted: Phase 3 expanded to include production-ready orchestration foundations (total phases reduced to 6).
  • Updated roadmap status: Phase 2 marked as Completed, Phase 3 marked as In Progress.
  • Refactored roadmap tables into a vertical layout for better readability in blog and document formats.
  • Updated What’s Next section to reflect current execution focus on cloud infrastructure.

v2.1 Phase 2 Learning Resources Update (December 2025)

  • Changed and refined Phase 1 status and documentation.
  • Added Phase 2 learning resources.
  • Updated What’s Next section.

v2.0 Major Revision (December 2025)

  • Roadmap structure changed: semester-based → phase-based (7 phases).
  • Cloud section updated: combined Cloud Foundations & Serverless Pipelines.
  • Big Data phase revised: PySpark, dbt, and Airflow integrated into a single project.
  • Added Workload Management & Burnout Prevention guidance.
  • Capstone scope clarified: full data platform + ML integration.

v1.1 Tooling Update (November 2025)

  • Introduced Terraform basics for early IaC exposure.
  • Added Docker containerization in Phase 3.

v1.0 Initial Structure (October 2025)

  • First version of the roadmap published.
  • Added foundation milestones and basic Python/SQL projects.
Karhomatul Faqih Al Amin
Authors
Data Engineer Learner
Data Engineer learner with a strong interest in data pipelines, ETL processes, and scalable data systems. Currently pursuing an undergraduate degree in Data Science, focuses on building practical projects using Python, SQL, and modern data engineering tools. My learning journey emphasizes hands-on implementation, reproducibility, and aligning academic foundations with real-world data engineering needs.