# Data Engineering - 19 April 2021 Extra Resources : https://is.gd/sagotu Slides & Labs: www.googlecloud.qwiklabs.com ### Course Objectives 1. Data Lake & Data Warehouse (Day 1) 2. Data Processing with DataProc, DataFlow, DataFusion, DataPrep (Day 2) 3. Streaming Data Processsing with DataFlow, BigQuery, Pubsub (Day 3) 4. Data Science with AI Platform (Day 4) ## Introduction General GCP Resources * Learning Path: https://cloud.google.com/training/data-ml#data-engineer-learning-path * Labs URL: https://googlecloud.qwiklabs.com/ * Technical Guide https://cloud.google.com/docs/tutorials * Solution Diagram Guide www.gcp.solutions * General Blogs for GCP Implementation: gcpweekly.com Instructions to Qwiklabs (https://imgur.com/a/WvY0f1K) 1. Sign in to googlecloud.qwiklabs.com 2. Start the lab 3. Open new incognito window or a different browser and open console.cloud.google.com in that incognito window - Sign in using free username & password generated in step 2 - Click on "Select a Project" in Blue Colored menu bar. And select the project id created in step 2 - Now you can execute the lab in the Project In a hypothetical team of 10 people working on Data Science project - 1 Manager - 3 Data Engineers - 2 Data Analysts - 2 or 1, Data Scientists - 1 Data Science Researcher - 1 Infrastructure Person Your background and how your approach to learning Google cloud changes according to your current role - Either learning google cloud for the first time - Learning things from scratch - Shifting to use Google cloud in the current Job - ETL Pipeline / Data Engineer - Data Analyst - (BigQuery) - Business Analyst - Data Architecting - Data Lake - Data Warehouse - Database - Database admin - Manages security - Manages network - Develops application - All of these roles, need to reinvent themselves because of Cloud - Infrastructure Modernization (Infra = Place where we run the code) - Imagine shopping - Small shops - Malls - Online Shopping - As a software developer, we need to reinvent ourself because of cloud - Cloud is as transformational as whatsapp has been for communication - Data Modernization Options for Data Storage 1. Database - SQL - NoSQL - NewSQL 1. Data Warehouse: Analytical Purpose 1. Data Lake: Storage Purposes Data Lake: Google Cloud Storage (Detailed Notes Check https://www.evernote.com/shard/s295/sh/4eae1b7f-4b9a-4ea8-98ed-c573bbce2a7a/082acabe8536aba84cb82999152e3f04) - Cloud Storage = Distributed File System - 4 Types of Storage classes - Standard - Nearline - Coldline - Archival Data Warehouse: Big Query Advanced Features - Data Modification Language supported but not optimized for - Metadata queries to query the details of the Datawarehouse - Nested & repeated fields ## Summary Day 1 - Key Points - What is the Data Engineers Role? - Doing Data Lake via Cloud Storage - Doing Data Warehouse via BigQuery - Expanded Notes - Data Engineers Role (4 Roles) 1. Build and Design Data Pipelines (ETL) 2. Data Warehouse 3. Data Lake 4. Business Intelligence Dashboard - Google Cloud Storage - Ideal product for Data Lake - Doesn’t mean it’s the only product for Data Lake - How: command line utility is gsutil - gsutil cp file gs://bucket_name - gsutil mv - Advantages: - Highly scalable (Auto Scale) . Infinite files without any manual overhead for maintainance - Completely Managed - Easy to Use - Limitation: Not suitable for very high frequency IO - Storage over network - If you are doing high io on ssd mounted on the device, its going to be faster - BigQuery - Ideal Product for Data Warehouse - Uses SQL to query Big Data - Even though the data looks like SQL, it’s actually a big Data - Optimized for big volume of Data queried frequently most of them are read queries. (OLAP not OLTP) - UI or bq command line tool or programming language - Advantages: - Serverless - Auto Scale - Fully Managed - Very Easy to Use (SQL for query) - Machine Learning capabilities - Cost effective with alternatives ways of cost management and optimization - Limitation: - Not best suitable for updates. - Updates can be done, there is no upper limit on these updates. But it’s not optimised for them - Not low latency. Possible cost problems ## Day 2 - Managed Product vs Product vs Fully Managed Product - AutoScale vs Scale OReilly Details - Get access to Oreilly library via ACM membership. - OReilly is 60$ per month, where as ACM professional membership for Developing country is just 1$ per month - Step 1: Get Acm professional membership from here: https://services.acm.org/public/qj/proflevel/countryListing.cfm?promo=PWEBTOP&form_type=Professional - Step 2: Login to Oreilly using ACM credentials here: https://go.oreilly.com/acm - Step 3: Leverage curated expert playlists for topic of your interests Dataflow Coding: https://beam.apache.org/blog/beam-kata-release/ (One of best learning guides) https://docs.google.com/presentation/d/17eq17-4KYvF1-2sCOo0sSUdm6gj4h6sWLhLDUYOe1cU/edit#slide=id.g119cd57211_0_16 Detailed Notes of all the products: https://www.evernote.com/shard/s295/sh/5c9b8689-5635-4dc2-945f-fe0b40ba7139/6cac4bd4987b9be1c08355934d9f651e Summary of all the course: https://www.evernote.com/shard/s295/sh/548afd23-9813-7837-f19c-adb6a00dc5f5/9dd7a5247c8da3dd5867a1fa0e592372