This repository serves as a comprehensive guide for individuals aspiring to become Data Engineers in 2024. It provides a detailed roadmap, recommended learning resources, and a collection of open-source projects to help you develop the necessary skills and gain hands-on experience.
Roadmap to becoming a data engineer in 2024
Beginners shouldn’t feel overwhelmed by the vast number of tools and frameworks listed here. A typical data engineer would master a subset of these tools throughout several years depending on his/her company and career choices.
This section will provide an overview and reference content to understand different frameworks and technologies required for Data Engineering field, their importance and the role of becoming a great Data Engineer!
Node: Some of the below resource links are not working, I am working to update them.
- Programming Languages
- Data Processing Frameworks
- Data Storage and Querying
- Data Streaming and Messaging
- Data Orchestration and Workflow Management
- Cloud Computing
- Data Modeling and ETL/ELT
- Data Visualization and Reporting
- Soft Skills
- Projects and Certifications
- Interview Preparation
- Contributing
- Python for Data Analysis - Free book from O'Reilly covering Python for data analysis.
- Python Data Science Handbook - Free book that covers the essential knowledge for working with data in Python.
- Python for Data Analysis Video Series - Video tutorials from Corey Schafer.
- Scala Programming Language - Official documentation and learning resources for Scala.
- Java Programming Language - Official Java tutorials from Oracle.
- Apache Spark Official Documentation
- Learning Spark: Lightning-Fast Data Analytics - Free book from Databricks (requires email signup).
- Spark Programming Guide - Book from O'Reilly.
- Apache Hadoop Official Documentation
- Hadoop: The Definitive Guide - Book from O'Reilly.
- PostgreSQL Tutorial - Free comprehensive PostgreSQL tutorial.
- MySQL Tutorial - Free MySQL tutorial for beginners.
- MongoDB University - Free online courses and certifications for MongoDB.
- Apache Cassandra Documentation - Official documentation for Apache Cassandra.
- HBase Reference Guide - Official reference guide for Apache HBase.
- Apache Hive Tutorial - Official Apache Hive tutorial.
- Presto Documentation - Official documentation for Presto.
- Apache Impala Documentation - Official documentation for Apache Impala.
- Apache Kafka Documentation - Official Apache Kafka documentation.
- Kafka: The Definitive Guide - Book from O'Reilly.
- Kafka Streams Documentation - Official documentation for Kafka Streams.
- Apache Flink Documentation - Official Apache Flink documentation.
- Apache Storm Documentation - Official Apache Storm documentation.
- Apache Airflow Documentation - Official Apache Airflow documentation.
- Airflow Tutorial - Official Airflow tutorial.
- Mastering Apache Airflow - Book from O'Reilly.
Detailed mapping of different product offerings from major cloud providers is available as part of cloud-product-mapping
- AWS Big Data Services - Overview of AWS big data services.
- Amazon EMR Documentation - Official documentation for Amazon EMR.
- Amazon S3 Documentation - Official documentation for Amazon S3.
- Amazon Athena Documentation - Official documentation for Amazon Athena.
- Amazon Redshift Documentation - Official documentation for Amazon Redshift.
- Azure Data Services - Overview of Azure data and analytics services.
- Azure HDInsight Documentation - Official documentation for Azure HDInsight.
- Azure Data Lake Storage Documentation - Official documentation for Azure Data Lake Storage.
- Azure Synapse Analytics Documentation - Official documentation for Azure Synapse Analytics.
- Google Cloud Data Services - Overview of Google Cloud data and analytics services.
- Google Cloud Dataproc Documentation - Official documentation for Google Cloud Dataproc.
- Google Cloud Dataflow Documentation - Official documentation for Google Cloud Dataflow.
- Google BigQuery Documentation - Official documentation for Google BigQuery.
- Data Modeling for Data Warehouses - Book by Len Silverston and Paul Agnew.
- Data Vault Modeling Guide - Book from O'Reilly.
- ETL/ELT with Python - Book from O'Reilly.
- Apache NiFi Documentation - Official documentation for Apache NiFi.
- Talend Open Studio Documentation - Official documentation for Talend Open Studio.
- Tableau Desktop Resources - Free training resources for Tableau Desktop.
- Power BI Documentation - Official documentation for Microsoft Power BI.
- Apache Superset Documentation - Official documentation for Apache Superset.
- Problem-Solving Techniques - Resources for developing problem-solving skills.
- Effective Communication Skills - Resources for improving communication skills.
- Collaboration and Teamwork - Resources for enhancing collaboration and teamwork.
- AWS Certified Big Data Specialty - AWS Certified Big Data Specialty certification.
- Azure Data Engineer Associate - Azure Data Engineer Associate certification.
- Google Cloud Professional Data Engineer - Google Cloud Professional Data Engineer certification.
- Data Engineering Interview Questions - Collection of data engineering interview questions.
- System Design Interview Questions - System design interview questions and resources.
My main motive to setup this repository is to consolidate differeent knowledge sources in one place. I will continue to improve this but I'm also just a human and we live in a world full of evolving technologies. If you have any suggestions, improvements, or additional resources to share, please feel free to contribute to this repository.



