# PDE05/2021.05.20: Data Engineering Certification Guide (Url = https://is.gd/ohemaj) ## Pre-Work https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 https://towardsdatascience.com/data-engineer-vs-data-scientist-bc8dab5ac124 https://www.oreilly.com/radar/data-engineers-vs-data-scientists/ https://cloud.google.com/certification/guides/data-engineer Online community: https://groups.google.com/g/certjourney-pde05 Google Drive: https://drive.google.com/drive/folders/104a4DGWNhfSRUs1HDGeSh5gMlSvW6zPX?usp=sharing If you cannot enter Google Drive or Online Community please fill this form: https://docs.google.com/forms/d/e/1FAIpQLSd3HDAclU5HtGUKOhGBdo-NBzBk3Sh25RfiwLFOcIQq2RTRVw/viewform?usp=sf_link https://www.evernote.com/shard/s295/sh/5c9b8689-5635-4dc2-945f-fe0b40ba7139/6cac4bd4987b9be1c08355934d9f651e ## Week 1: Modernizing Data Lakes & Data Warehouses with GCP - Day 1 - Key Points - Data Engineers Role - Data Lake via Cloud Storage - Data Warehouse via BigQuery - Expanded Notes - Data Engineers Role 1. Build and Design Data Pipelines (ETL) 2. Data Warehouse 3. Data Lake 4. Business Intelligence - Google Cloud Storage - Data Lake is a place optimized for storage of large kind of Data. Offers cheap storage. - Analytics capabilities of Data lake are limited - Google Cloud Storage is best fit to be used as Data Lake - You create a Project in Google Cloud. You then create bucket’s in the project where you can save the Data - Cloud Storage is PaaS. In another words, it automatically scales storage capacity. If a lot of people are accessing the Data, it also automatically scales to meed the demand. - Ideal product for Data Lake - Doesn’t mean it’s the only product for Data Lake - How: command line utility is gsutil - gsutil cp file gs://bucket_name - gsutil mv - Advantages: - Highly scalable (Auto Scale) . Infinite files without any manual overhead for maintainance - Completely Managed - Easy to Use - Limitation: Not suitable for very high frequency IO - Storage over network - If you are doing high io on ssd mounted on the device, its going to be faster - BigQuery - Ideal Product for Data Warehouse - Uses SQL to query Big Data - Even though the data looks like SQL, it’s actually a big Data - Optimized for big volume of Data queried frequently most of them are read queries. (OLAP not OLTP) - UI or bq command line tool or programming language - Advantages: - Serverless - Auto Scale - Fully Managed - Very Easy to Use (SQL for query) - Machine Learning capabilities - Cost effective with alternatives ways of cost management and optimization - Limitation: - Not best suitable for updates. - Updates can be done, there is no upper limit on these updates. But it’s not optimised for them - Not low latency. Possible cost problems - Details - BigQuery is a Data Warehouse not a Database - Even though it uses SQL for quering - Even thought Data is Tabular - Even thought Data Looks relational.  - It is not a relational Database - Difference in BigQuery vs SQL - BigQuery is for big Data Analytics, its not transactional like SQL - BigQuery’s SQL doesn’t natively support update and delete. (It supports updates and delete via DML feature. Check it’s documentation. And also see how good it’s performance is, because I am not sure) - BigQuery doesn’t have indexing - BigQuery doesn’t have primary key or foreign key concept as well - BigQuery is a Data warehouse used for Analytics. Connect well with Data Visualization Tools - BigQuery Pricing - You pay for storage - You pay for query execution - You pay either by Data Processed or a Flat Rate - Slots - Data Lake - Data Warehouse : BigQuery ## Week 2: Building Batch Pipelines on GCP Storage 1. Cloud Storage Databases on Google Cloud 1. Bigtable 2. Datastore 3. Firestore 4. Memorystore 5. In memory database. Used for caching 6. Spanner 7. SQL 7. Relational Database. (Mysql, Postgres SQL & SQL Server) Types of Databases 1. Relational Database 1. Doesn't store big data 2. Stores Structured Data. (Tabular format of the Data) 3. Large number of reads & writes to the data. 80% are writes generally & 20% are reads 4. Doesn't support really high number of concurrent queries. 5. It is not Distributed System 2. The next evolution asked for different kinds of Databases for different requirements 2. When the Data is Big & Not structured & it's distributed 9. BigTable 10. Data is stuctured but not relational 11. Low latency (in order of miliseconds) & high number of concurrent queries 12. And performance increases linearly with increased compute power 11. Datastore 11. Data Structure is Key value pair 12. Firestore 12. Optimized for realtime & mobile application backends 13. In cases where Data is relational but is big data & needs high concurrent queries and high performance 14. We need Relational database which is also distributed database 15. When the Data is historical Data vs When the Data is live data Data Processing - DataProc - First attempl of doing data processing of big data - Dataflow - Dataflow is the latest data processing system - It has features which dataproc doesn't - It does both streaming & batch data processing - it doesn't need manual intervemntion - Automatically scales - DataFlow Advanced - https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub - https://cloud.google.com/dataflow/docs/guides/flexrs - Key Points - Data Processing for Batch Data using - Hadoop - DataFlow - Data Fusion - Dataprep - Data Pipeline Orchestration via Cloud Composer - Expanded Notes - Data Processing for Batch Data - DataProc - Managed Hadoop 1. Migration of Data & Migration of Clusters from on prem to on cloud 2. What does it do? 1. Batch Data Processing 1. Hadoop, Spark or etc to do Data Processing 2. Wordcount is simplest and most common problem in Hadoop 2. Streaming Data Processing - Have to install corresponding libraries & components 3. Machine Learning 1. Spark ML Lib 4. Data Analytics 5. Data Warehousing 6. No SQL Database - DataFlow 1. All data processing dataproc can do, dataflow can do it.  1. data flow can do streaming data processing natively unlike dataproc 2. dataflow can't do anything other than data processing unlike dataproc 2. Only do Data Processing 3. Autoscale of compute for the code 4. Streaming Features are present too 5. DataFlow requires apache beam to execute 1. But with help of BigQuery, you can run simple Dataflow jobs by writing SQL query - Data Fusion 1. Data Processing Pipelines with UI Tool (CDAP tool) 1. Unlike Dataflow which needs code 2. Can Build Batch and streaming pipelines both 1. More expensive product than dataflow 2. But easier than dataflow - DataPrep 1. UI Tool (Trifecta) for Data Preparation for ML target audience - Data Pipeline Orchestration via Cloud Composer Dataflow - Dataflow can do Only ETL - It is fully managed - Cluster creation, cluster deletion, cluster autoscale, cluster optimization and everything else, is done automatically - We just write code and everything else is take care of. - It is natively autoscale - You just write code - Supports Java, Python and Go. With new runner it supports even more - https://cloud.google.com/blog/products/data-analytics/multi-language-sdks-for-building-cloud-pipelines - https://beam.apache.org/documentation/runners/capability-matrix/ - https://engineering.atspotify.com/2020/02/18/spotify-unwrapped-how-we-brought-you-a-decade-of-data/ - Lab Side Input Code: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/data_analysis/lab2/python/JavaProjectsThatNeedHelp.py DataProc - Dataproc is a Managed Hadoop product on google cloud. Not infrastructure as a service on Google Cloud - Can be initialised with any open source libraries required, by simply giving path of pre written initialisation scripts - https://github.com/GoogleCloudDataproc/initialization-actions - https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-overview - DataProc - Is Managed Hadoop - We can do Data Processing/ETL on Dataproc - We can do Data Warehousing on Dataproc as well - We can do No sql Database on Dataproc as well - We can do Machine Learning on Dataproc as well - But - Managed Product not Fully Managed Product - It means, we still have to do partial management of the Dataproc cluster - Autoscaling is not a native feature but added one. It’s not perfect.  - Autoscaling doesn’t happen automatically. You have to create autoscaling policy.  - gcloud beta interactive ## Week 3: Building Resilient Streaming Analytics Systems on GCP ## Week 4: Smart Analytics, Machine Learning and AI on GCP ## Week 5: Hands on Lab Pracctice ## Week 6: Preaparing for the Google Cloud Professional Data Engineering Exam & Check Readiness