# Data Engineering 3 May Notes: is.gd/amusec
Git Repo, if slow, then use https://gitlab.com/ajinkyakolhe112/training-data-analyst
Lab : **googlecloud.qwiklabs.com** (Login & check if you can see classroom on home page under in progress section)
Slides: https://googlecloud.qwiklabs.com/classrooms/9748/notes
Objective for 4 Days Training
1. Day 1: Learn Data Lake (Google Cloud Storage) & Data Warehouse (BigQuery)
1. Day 2: Learn Data Processing via (DataProc, DataFlow). Learn Cloud Composer if Possible
1. Day 3: Learn Data Processing for Stream Data via (PubSub, DataFlow, DataFusion)
1. Day 4: Learn Analytics AI Products
---
AIM: For certification. (https://www.evernote.com/shard/s295/sh/ab8acf7b-98b0-46b3-afbd-3756b46a825e/ffb53c4f70d0fe7fb85f56a9a80bad2f)
Or
Your background and how your approach to learning Google cloud changes according to your current role
- Either learning google cloud for the first time
- Learning things from scratch
- Shifting to use Google cloud in the current Job
- ETL Pipeline / Data Engineer
- Data Analyst
- (BigQuery)
- Business Analyst
- Data Architecting
- Data Lake
- Data Warehouse
- Database (SQL, NoSQL, NewSQL)
- Database admin
- Manages security
- Manages network
- Develops application
- All of these roles, need to reinvent themselves because of Cloud
- Infrastructure Modernization (Infra = Place where we run the code)
- Imagine shopping
- Small shops
- Malls
- Online Shopping
- As a software developer, we need to reinvent ourself because of cloud
- Cloud is as transformational as whatsapp has been for communication
- Data Modernization
In a hypothetical team of 10 people working on Data Science project
- 1 Manager
- 3 Data Engineers
- 2 Data Analysts
- 2 or 1, Data Scientists
- 1 Data Science Researcher
- 1 Infrastructure Person
General GCP Resources
1. Introduction to Google Cloud - https://www.youtube.com/watch?v=UF2d0EDWGNA&list=PLY7sQ59Bufns3VafkhnHpbdbGBrTxSXwi
2. https://github.com/gregsramblings/google-cloud-4-words
3. https://gcp.solutions/
1. Problem
2. Design
1. Possible Solutions
2. Choosing one Solution Based on problem & requirements
3. Build
1. Developing the chosen solution
4. https://cloud.google.com/docs/tutorials : Very Important
5. Compare Cloud Providers
1. http://comparecloud.in/
2. https://cloud.google.com/docs/compare/azure
3. https://cloud.google.com/docs/compare/aws
6. https://www.gcpweekly.com or https://www.gcpweekly.com/gcp-resources/ : Very Important too
7. Details of Certifications: https://www.evernote.com/shard/s295/sh/ab8acf7b-98b0-46b3-afbd-3756b46a825e/ffb53c4f70d0fe7fb85f56a9a80bad2f & Practise Exam Link: https://cloud.google.com/certification/sample-questions/data-engineer & Book to Help certification preparation: https://learning.oreilly.com/library/view/official-google-cloud/9781119618430/
8. Next 2020: https://cloud.google.com/blog/topics/google-cloud-next/complete-list-of-announcements-from-google-cloud-next20-onair
## Day 1: Data Lake & Data Warehouse
### Data Lake
1. https://cloud.google.com/architecture/build-a-data-lake-on-gcp
Data Lake: Google Cloud Storage (Detailed Notes Check https://www.evernote.com/shard/s295/sh/4eae1b7f-4b9a-4ea8-98ed-c573bbce2a7a/082acabe8536aba84cb82999152e3f04)
- Cloud Storage = Distributed File System
- 4 Types of Storage classes
- Standard
- Nearline
- Coldline
- Archival
Data Warehouse: Big Query
Advanced Features
- Data Modification Language supported but not optimized for
- Metadata queries to query the details of the Datawarehouse
- Nested & repeated fields
## Summary of Entire Training
- Day 1
- Key Points
- What is the Data Engineers Role?
- Doing Data Lake via Cloud Storage
- Doing Data Warehouse via BigQuery
- Expanded Notes
- Data Engineers Role (4 Roles)
1. Build and Design Data Pipelines (ETL)
2. Data Warehouse
3. Data Lake
4. Business Intelligence Dashboard
- Google Cloud Storage
- Ideal product for Data Lake
- Doesn’t mean it’s the only product for Data Lake
- How: command line utility is gsutil
- gsutil cp file gs://bucket_name
- gsutil mv
- Advantages:
- Highly scalable (Auto Scale) . Infinite files without any manual overhead for maintainance
- Completely Managed
- Easy to Use
- Limitation: Not suitable for very high frequency IO
- Storage over network
- If you are doing high io on ssd mounted on the device, its going to be faster
- BigQuery
- Ideal Product for Data Warehouse
- Uses SQL to query Big Data
- Even though the data looks like SQL, it’s actually a big Data
- Optimized for big volume of Data queried frequently most of them are read queries. (OLAP not OLTP)
- UI or bq command line tool or programming language
- Advantages:
- Serverless
- Auto Scale
- Fully Managed
- Very Easy to Use (SQL for query)
- Machine Learning capabilities
- Cost effective with alternatives ways of cost management and optimization
- Limitation:
- Not best suitable for updates.
- Updates can be done, there is no upper limit on these updates. But it’s not optimised for them
- Not low latency. Possible cost problems
- Day 2
- Key Points
- Data Processing for Batch Data using
- Hadoop
- DataFlow
- Data Fusion
- Dataprep
- Data Pipeline Orchestration via Cloud Composer
- Expanded Notes
- Data Processing for Batch Data
- DataProc - Managed Hadoop
1. Migration of Data & Migration of Clusters from on prem to on cloud
2. What does it do?
1. Batch Data Processing
1. Hadoop, Spark or etc to do Data Processing
2. Wordcount is simplest and most common problem in Hadoop
2. Streaming Data Processing - Have to install corresponding libraries & components
3. Machine Learning
1. Spark ML Lib
4. Data Analytics
5. Data Warehousing
6. No SQL Database
- DataFlow
1. All data processing dataproc can do, dataflow can do it.
1. data flow can do streaming data processing natively unlike dataproc
2. dataflow can't do anything other than data processing unlike dataproc
2. Only do Data Processing
3. Autoscale of compute for the code
4. Streaming Features are present too
5. DataFlow requires apache beam to execute
1. But with help of BigQuery, you can run simple Dataflow jobs by writing SQL query
- Data Fusion
1. Data Processing Pipelines with UI Tool (CDAP tool)
1. Unlike Dataflow which needs code
2. Can Build Batch and streaming pipelines both
1. More expensive product than dataflow
2. But easier than dataflow
- DataPrep
1. UI Tool (Trifecta) for Data Preparation for ML target audience
- Data Pipeline Orchestration via Cloud Composer
- Day 3
- Key Points
- Streaming Data Communication
- Streaming Data Processing using
- DataFlow
- BigQuery
- Data Visualization using Data Studio
- NoSQL Database: BigTable
- Expanded Summary
- Pub/Sub
1. Message communication system
2. Fully Managed Product
1. Auto Scale
2. Serverless
3. Only focus on coding
3. Alternative kafka
4. Gurantees at least once delivery
1. Can send messages multiple times
2. Can send messages out of order
- DataFlow Streaming
1. Same as batch but few features like
1. Watermark
2. Trigger
3. Windowing
2. Dataflow
1. Feature of deduplication
2. Can do ordering of data according to time
- BigQuery Streaming Inserts
- DataStudio/Looker or Any other dash boarding tool
- BigTable
- Day 4
- Key Points
- Three ways of doing AI
- APIs
- Auto ML
- Custom Model
- AI Platform
- Hadoop
- BigQuery ML
- Doing AI Pipeline using Kubeflow
Get access to Oreilly library via ACM membership.
(OReilly is 60$ per month, where as ACM professional membership for Developing country is just 1$ per month)
- Step 1: Get Acm professional membership from here: https://services.acm.org/public/qj/proflevel/countryListing.cfm?promo=PWEBTOP&form_type=Professional
- Step 2: Login to Oreilly using ACM credentials here: https://go.oreilly.com/acm
- Step 3: Leverage curated expert playlists for topic of your interests
DataProc
1. DataProc: Managed Hadoop. Nothing but Hadoop on GCP. (Called EMR/Elastic Mapreduce in aws and HDinsight in Azure)
- Cluster: Group of machines who work together parallely. It’s divide the work and do it parallel. That is the basis of big Data
- Migrate on prem Hadoop to On cloud Dataproc
- https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-jobs
- https://cloud.google.com/solutions/migration/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc
- https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-overview
BigQuery
- DataBase: R&W also called as OLTP. Lots R & W
- SQL
- Structured Data
- Not a Big Data, Not a distributed System
- NoSQL
- Structured or Unstructed Data
- Big Data
- Is a distributed System
- Wasn't easy to query & wasn't transaction
- NewSQL
- Structured but Big Data
- Combines Distributed computing computing of NOSQL with easy querying of SQL
- DataWarehouse: OLAP. 80% times you are doing read queries on a huge Data
- Querying should be easy & should query large volume of Data
- Data Lake: No Database no data analytics
Bigquery
- https://cloud.google.com/blog/products/bigquery/anatomy-of-a-bigquery-query
- SQL query
select language, sum(views) views
from bigquery-samples.wikipedia_benchmark.Wiki100B
where REGEXP_CONTAINS(title,".*o.*o.o.")
group by language
order by views desc
- https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview
- BigQuery ML
- https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create
- BigQuery
- https://www.qwiklabs.com/quests/147?parent=catalog
- https://www.qwiklabs.com/focuses/3460?parent=catalog
- BigQuery ML
- https://www.qwiklabs.com/focuses/1797?parent=catalog
- https://www.qwiklabs.com/focuses/16547?parent=catalog
- https://www.qwiklabs.com/focuses/14294?parent=catalog
- Other competitions where BigQuery ML can be used
1. https://www.kaggle.com/c/zillow-prize-1 (1.2 million dollars prize)
- Predicting price of the house
2. https://www.kaggle.com/c/two-sigma-financial-modeling 1 million dollars prize
- Predicting direction of the share movement. Up or Down
3. https://www.kaggle.com/c/deloitte-western-australia-rental-prices 1 million dollars prize
- Predicting Rental Price
4. https://www.kaggle.com/c/home-credit-default-risk 70k dollars prize
- Predicting Default or Not
5. https://www.kaggle.com/c/santander-customer-transaction-prediction 65k dollars prize
- Predicting Whether will buy or not
- Equivalent ML in DataProc: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/fe15f2de910ff4ccc4684b7a992619712d4d1f5a/CPB100/lab3b/sparkml/train_and_apply.py#L50