# Done: Big Data and Machine Learning - 12 March 2021
> This Doc: **https://tinyurl.com/gcp-bdml-12mar21**
> Slides: https://is.gd/ijotuy or https://1drv.ms/b/s!Aq6hYeVV5o6DhstxiE85WHGWV5bISQ?e=E0U9d8
> Slides: https://is.gd/ijotuy or https://1drv.ms/b/s!Aq6hYeVV5o6DhstxiE85WHGWV5bISQ?e=E0U9d8
Timing: 10.00AM to 2.45PM
Lunch Break: 1.00PM to 1.45PM
---
# Notes
## Module 1: Intro to GCP
General GCP Resources
- Check the learning paths: https://cloud.google.com/training#learning-paths
- https://www.gcpweekly.com/gcp-resources/
- https://github.com/gregsramblings/google-cloud-4-words
- Revist today’s course online (4 different ways):
1. https://www.coursera.org/learn/gcp-big-data-ml-fundamentals (Audit the course then you can do it for free. https://www.classcentral.com/report/coursera-signup-for-free/)
2. https://cloudonair.withgoogle.com/events/apac-gcp-fundamentals-series
3. https://cloudonair.withgoogle.com/events/cloud-onboard-data-fundamentals
4. https://www.youtube.com/playlist?list=PLY7sQ59Bufns3VafkhnHpbdbGBrTxSXwi
- Free Google Cloud Access: https://go.qwiklabs.com/qwiklabs-free
- Details of Certification: https://www.evernote.com/shard/s295/sh/ab8acf7b-98b0-46b3-afbd-3756b46a825e/ffb53c4f70d0fe7fb85f56a9a80bad2f
Types of Jobs on Cloud
1. Migration. We move the codebase’s place of execution from on prem to on cloud
- Google Cloud has partner ecosystem and patterns are heavily involved in migration.
- Many of the technical guides related to migration: Find them here https://cloud.google.com/docs/tutorials
2. Cloud Native Development or New application: Generally startups do this or MNCs for a new application
3. Feature Addition on prexisting cloud env product
4. Maintenance
Components of Big Data Systems
- Databases SQL, No SQL, New SQL
- Data Lake & Data Warehouse
- Data Processing, ETL Pipeline
- Business Intelligence
- Artificial Intelligence: Machine Learning or Deep Learning
OnPrem vs Cloud vs Serverless Cloud
- OnPrem - User configured, user managed and user maintained
- Cloud - User configured, provider managed and provider maintained
- Different Ways of Using Cloud
- (on prem is user bought, user configured & user maintained)
- Infrastructure as a Service: User configured, user maintained & Provider provided
- Platform as a Service / Managed Product: User configured, provider managed & maintained but partial work still is needed by user
- Fully Managed / Serverless: Everything is done by the provider. User just codes
- Example restaurant
- IaaS: You cook in the restaurant.
- PaaS: Buffet self service
- Serverless: Waiter serves you prepared food
- https://www.episerver.com/articles/pizza-as-a-service & https://www.bmc.com/blogs/saas-vs-paas-vs-iaas-whats-the-difference-and-how-to-choose/
- Serverless Cloud - Fully automated and no configuration required
## Module 2: Recommendations & Predictions with Hadoop
Storage: Apache HDFS/Amazon S3/Google Cloud Storage
Processing: Hadoop/MapReduce/Spark/YARN/HiveQL/Pig Latin
Recommendation is Part of AI
Any AI needs 3 things
1. Data
- Can be saved in databases sql or no sql
- Can be saved in data lake or data warehouse
- Data needs to be processed to bring it in format needed for ML model
2. Model
- Many programming languages can be used to write ML model. Tensorflow, Pytorch, MLLib, Scikit learn, xgboost, BigqueryML, R, …
3. Infrastructure to run Model
Products
1. DataProc: Managed Hadoop. Nothing but Hadoop on GCP. (Called EMR/Elastic Mapreduce in aws and HDinsight in Azure)
- Cluster: Group of machines who work together parallely. It’s divide the work and do it parallel. That is the basis of big Data
- Migrate on prem Hadoop to On cloud Dataproc
- https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-jobs
- https://cloud.google.com/solutions/migration/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc
- https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-overview
2. Cloud SQL: Managed RDBMS (MySql, SQL server and PostgreSQL)
- OLTP
- Migrate on prem sql to Cloud SQL
- Migrate oracle to cloud sql https://cloud.google.com/solutions/migrating-data-from-oracle-to-cloud-sql-for-mysql or https://cloud.google.com/solutions/migrating-mysql-to-cloudsql-concept
- Migrate others
## Module 2.5
BigQuery
- Uses SQL to query big Data. It is fully managed / serverless product.
- It is a Data Warehouse, not a Database. Data looks like RDBMS, but Datawarehouse is optimized for Read queries not update or delete.
- For Reddit BigQuery datasets(references) : https://www.reddit.com/r/bigquery/wiki/datasets
- BigQuery Syntax: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
- Advanced features
- Supports geostationary data
- Supports Machine learning
- Bigquery
- https://cloud.google.com/blog/products/gcp/anatomy-of-a-bigquery-query
- https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview
- BigQuery ML
- https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create
AI(Artificial Intelligence): learning from Data (Superset)
DL(Deep Learning): learning from Semi-structured or Un-structured Data like No/SQL. Data is generally big data, but it need not be.
- All about data from human sensors (machine sensors might have tabular data so ML)
ML(Machine Learning): learning from Tabular data or Structured data or RDBMS. Data could be big data or not.
DS(Data Science): Not just limited to data modelling, but also data storage, data processing etc. Cloud computing too
- collect Data, organize data in DW. Not only creating Models in compute engine
- Data warehouse is a central place to store/save all types of data which might be needed in future
BigQuery Data Warehouse: Best suited for big data analytics queries. Very easy product to use and it’ serverless.
Keywords
1. AI or Artificial Intelligence
- AI is superset
- ML is subset of AI
- DL is subset of ML
- Data Science is ML or DL on Cloud using all related tools for the job
2. Data Science
3. ML or Machine Learning
- Tabular Data
4. DL or Deep Learning
- Unstructured Data. Images, Video, Audio, Text etc
Every AI is made up of
1. Data
2. Model
3. Infra to run the Model
Types of AI Problem
1. Recommendation AI
2. Value Prediction AI
3. Class Prediciton AI
4. Analmoly Detection AI
https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create
https://docs.looker.com/data-modeling/learning-lookml/what-is-lookml
Bigquery
- https://cloud.google.com/blog/products/gcp/anatomy-of-a-bigquery-query
- https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview
- BigQuery ML
- https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create
- BigQuery
- https://google.qwiklabs.com/quests/147?parent=catalog
- https://google.qwiklabs.com/focuses/3460?parent=catalog
- BigQuery ML
- https://google.qwiklabs.com/focuses/1797?parent=catalog
- https://google.qwiklabs.com/focuses/14294?parent=catalog
- Other competitions where BigQuery ML can be used
1. https://www.kaggle.com/c/zillow-prize-1 1.2 million dollars prize
- Predicting price of the house
2. https://www.kaggle.com/c/two-sigma-financial-modeling 1 million dollars prize
- Predicting direction of the share movement. Up or Down
3. https://www.kaggle.com/c/deloitte-western-australia-rental-prices 1 million dollars prize
- Predicting Rental Price
4. https://www.kaggle.com/c/home-credit-default-risk 70k dollars prize
- Predicting Default or Not
5. https://www.kaggle.com/c/santander-customer-transaction-prediction 65k dollars prize
- Predicting Whether will buy or not
Different roles in Data Team Hypothetical Scenario
- 1x Infra (Physical, IaaS or Cloud)
- 1x DevOps (Stack automation, Containers and Platform as a service)
- 2x or 3x Data Engineer (Data Pipelines, Data Automation, Data as a Service, Data Ingestion)
- 2x Analytics (1x Batch Analytics, 1x Real-Time Analytics and Predictive APIs)
- 1x AI and ML Data Scienst (Machine Learning and AI algorithms)
- 1x Front-End Dev (Web and Js developer, web and mobile apps)
- Specialized Roles
- 1x Network Architect
- 1x Security Engineer
- 1x Data Viz Developer
## Module 4
Different ways of doing AI
1. API
1. Ready to use API. No need to write model or give Data
2. Good enough but not Great
2. Auto ML
1. No need to write model. need to give only subset of Data and you get more accuracy than API
3. Custom Model
1. Your own Model, your own Data.
2. Complex to write custom model
3. Different ways of writing custom Model
10. Easiest: BigQuery ML
11. Easier: DataProc
12. Easy: Keras
13. Hard: Tensorflow
---
# Questions
- [ ] test
- [ ] While renting what if we are not aware of the peak time and random accesses. How are surprise demands managed?
- [ ] What is the market for Collaboration Professional Certified Engineer? Does the path only require administration knowledge?