My Data Engineering Roadmap as a Data Science Student

My Data Engineering Roadmap as a Data Science Student

19 October 2025·
Karhomatul Faqih Al Amin
Karhomatul Faqih Al Amin
· 4 min read
Image credit: Photo by Wes Hicks on Unsplash

After sharing my motivation for pursuing Data Engineering, I wanted to put that passion into a structured plan. Something that connects what I learn in university, online courses, and real-world practice.


Overview

I designed this roadmap to align with eight semesters of my Data Science degree, combining theory from coursework and technical depth from self-learning. Each semester focuses on a specific layer of data engineering, from local experimentation to production-grade cloud systems.

This roadmap isn’t just a checklist, it’s a direction map. Flexible enough to adapt as I grow, but clear enough to keep me focused.

As a Data Science student, my goal is to build solid foundations first, automate second, and scale last.


Roadmap Table

Foundation (Phase 1-2)

PhaseStatusMain FocusLearning ResourcesKey Skill & TopicsTools / PlatformProject / Portfolio Output
1. Foundation Basics (2 months)✅ CompletedPython & SQL FundamentalsCisco - (Python Essentials 1, Linux Unhatched), Codecademy - Learn SQL, Codedex - Learn Git & GithubBasic statistics, Python basics, SQL fundamentals, intro to Git & LinuxPython, Jupyter Notebook, GitHub, Linux Shell, PostgreSQL/MySQLProject 1: Basic Python scripting + simple SQL database & exploratory analysis
2. Local ETL & Modeling (2 months)⏳ In ProgressDatabases & Local Data HandlingDatabricks Academy (Data Modeling Strategies, Intro to Python for DS & DE, SQL Programming & Procedural Logic, Get Started with SQL Analytics & BI, Databricks SQL Performance Best Practices), pandas docs, PostgreSQL tutorial, MySQL docs, IBM: Linux Commands & Shell ScriptingCSV/JSON ingestion, Pandas ETL, Bash automation, ERD design, normalization, basic ETL patternsPostgreSQL/MySQL, GitHub, Pandas & NumPy, Bash CLIProject 2: Local ETL: ingest → clean → store data in SQL DB → export to CSV (with proper data modeling)

Cloud/Pipeline (Phase 3-5)

PhaseStatusMain FocusLearning ResourcesKey Skill & TopicsTools / PlatformProject / Portfolio Output
3. Cloud Foundations (4 months)Cloud basics, Infrastructure setup, ContainerizationIAM fundamentals, Object storage (S3/GCS), networking basics (VPC/Subnet), advanced Bash, Dockerizing environments, IaC fundamentalsAWS S3, IAM, Boto3, Bash, Docker, TerraformProject 3: Automated data backup to cloud using Python/Bash (deploy via Docker container)
4. Serverless & IaC Pipelines (4 months parallel with phase 3)Serverless Infrastructure & Basic PipelinesEvent-driven processing, Lambda functions, queue-based workflows, micro ETL orchestration, IaC deployment pipelinesAWS Lambda, SQS, Terraform, GitHubProject 4: Serverless data pipeline deployed via Terraform
5. Big Data & Orchestration (5 months)Big Data, Cloud ETL & Basic StreamingSpark DataFrames, distributed processing, Glue ETL workflows, Redshift integration, dbt SQL modeling, Airflow DAG orchestration, data quality frameworksApache Spark (PySpark), AWS Glue, Redshift, Airflow/Step Functions, Kafka/Kinesis, dbtProject 5: Cloud ETL — raw → Glue → Redshift/S3 + basic streaming ingestion + data quality checks

Advanced/Production (Phase 6-7)

PhaseStatusMain FocusLearning ResourcesKey Skill & TopicsTools / PlatformProject / Portfolio Output
6. DevOps & Production (4 months)CI/CD, Observability & Container OrchestrationInfrastructure automation, secret management, GitHub Actions pipelines, test workflows, monitoring & logging, container orchestration (K8s)Terraform, GitHub Actions, CloudWatch, Docker, KubernetesProject 6: Production-grade ETL with CI/CD, tests, monitoring, containerized deployment
7. Lakehouse & Capstone (8 months)Lakehouse, Advanced Streaming & ML IntegrationDelta Lake/Hudi, ACID ingestion, Spark optimization, incremental loads, real-time streaming (Flink/Spark Streaming), data governance, feature stores, ML pipeline integrationDatabricks, Delta Lake, Flink/Spark Streaming, Great Expectations, MLflow, Terraform, Airflow, Power BI/MetabaseCapstone: Full data platform — ingestion → processing → catalog → quality → dashboard + ML pipeline

How I’ll Use This Roadmap

  • As a compass, to keep me aligned with my long-term goal: becoming a professional Data Engineer.

  • As a progress tracker, to document what I’ve learned and what still needs improvement.

  • As content inspiration, for future blog posts and portfolio updates.

This roadmap isn’t final, it’s something I’ll refine every semester as I gain experience through university projects, certifications, and my ongoing work in the data industry.


What’s Next

Phase 1 is complete.
Phase 2 is now underway, focusing on Local ETL & Data Modeling, ingesting, cleaning, modeling, and storing data locally in a structured, reproducible way.

At the end of Phase 2, I’ll write a reflection post on what worked, what broke, and what I learned while moving from a simple pipeline to more structured local ETL workflows.


Roadmap Update Log

v2.1 — Phase 2 Learning Resources Update (December 2025)

  • Change and refine Phase 1 status and documentation.

  • Added Phase 2 learning resources.

  • Update on What’s Next section.

v2.0 — Major Revision (December 2025)

  • Roadmap structure changed: semester-based → phase-based (7 phases).

  • Cloud section updated: combined Cloud Foundations & Serverless Pipelines.

  • Big Data phase revised: PySpark, dbt, and Airflow integrated into a single project.

  • Added Workload Management & Burnout Prevention guidance.

  • Capstone scope clarified: full data platform + ML integration.

v1.2 — Tooling Update (November 2025)

  • Introduced Terraform basics for early IaC exposure.

  • Added Docker containerization in Phase 3.

v1.1 — Initial Structure (October 2025)

  • First version of the roadmap published.

  • Added foundation milestones and basic Python/SQL projects.

Karhomatul Faqih Al Amin
Authors
Data Engineer Learner
Data Engineer learner with a strong interest in data pipelines, ETL processes, and scalable data systems. Currently pursuing an undergraduate degree in Data Science, focuses on building practical projects using Python, SQL, and modern data engineering tools. My learning journey emphasizes hands-on implementation, reproducibility, and aligning academic foundations with real-world data engineering needs.