Project Planning: 'Phan Le Nhu Huyen'

# Project Planning: Phan Le Nhu Huyen Date: 11/02/2024 ## 1. Dataset Considered The Million Song dataset: http://millionsongdataset.com/ I chose The Million Song Dataset because it offers a good balance between manageable size and real-world complexity. This allows me to leverage learning opportunities in my data engineering career path, with a particular focus on data warehouse design using tools like AWS S3, Redshift, and Python. Additionally, the dataset's potential would allow me to expand the scope of my project later on by incorporating data lake or data pipeline design. ## 2. Big Questions to solve #### 2.1 About S3 Storage and Staging Area: * How will data be transferred from the original resource to AWS S3 (direct upload, data transfer tool?) * What is the data partitioning strategy for efficient retrieval from S3 (by date, by source, etc.)? * Will there be any data transformation or cleaning steps performed in S3 (schema validation, compression)? #### 2.2 About Redshift Staging and Transformation: * What schema will be used for the staging tables in Redshift (mirroring source tables or normalized format)? * How will data be loaded from S3 to Redshift staging tables (ETL tool, custom script)? * What data transformation logic will be applied in the staging tables (data cleaning, deduplication, date formatting)? #### 2.3 About Data Warehouse Design: * What dimensional model (star schema, snowflake schema) will be used for the data warehouse? * What fact tables and dimension tables will be required for the desired analytics? * How will the relationships between dimension and fact tables be defined (foreign keys)? #### 2.4 About ETL Pipeline Implementation: * What scheduling tool will be used to automate the ETL process? * How will error handling and data quality checks be integrated into the ETL pipeline? * How will the success or failure of each ETL job be monitored and logged? #### 2.5 About Security and Access Control: * What access controls will be implemented to restrict access to sensitive data in Redshift? * How will data encryption be used to protect data at rest and in transit? * What user authentication mechanisms will be used for accessing the data warehouse? #### 2.6 Testing and Deployment: * What unit tests will be conducted to ensure the functionality of the ETL pipeline? * How will the data warehouse be tested for data accuracy and completeness? * What is the deployment strategy for the data warehouse (manual deployment, infrastructure as code)? ## 3. High Level Implementation Plan #### Week 1: S3 Storage and Staging Area - [x] Choose a data transfer method - [x] Define a partitioning strategy - [x] Pre-process data in S3 if needed #### Week 2: Redshift Staging and Transformation - [x] Define Staging Table Schema - [x] Load Data from S3 to Redshift - [x] Apply Data Transformation Logic #### Week 3: Data Warehouse Design - [x] Analyze Business Requirements and Data Sources - [x] Choose Dimensional Model - [x] Design Fact Tables - [x] Design Dimension Tables #### Week 4: ETL Pipeline Implementation - [ ] Automating the ETL Process - [ ] Error Handling and Data Quality Checks - [ ] Monitoring and Logging Job Success/Failure