# Done: Big Data & ML on GCP Fundamentals - 2 Feb 2021
- **Beware of company vpn**. Try disabling vpn or personal mobile or personal laptop.
- Collaborative Notes: is.gd/ehidec
---
## Notes
### Section 1: Introduction to Google Cloud
What is Cloud
- Nothing but specialized datacenter of a company which it is renting out
- Rents Infrastrastructure to others so that we don't have to buy and manage it
- on premises we need to do everything by ourselves
- google cloud offers IaaS , PaaS , Managed Services
- https://www.episerver.com/articles/pizza-as-a-service
- Google Cloud Learning Paths: https://cloud.google.com/training
Types of Jobs on Cloud
1. **Migration**. We move the codebase from on prem to on cloud
* Google Cloud has partner ecosystem and parterns are heavily involved in migration.
* Many of the technical guides related to migration: Find them here https://cloud.google.com/docs/tutorials
2. **Cloud Native Development or New application**: Generally startups do this
3. Feature Addition on prexisting cloud env product
4. Maintainance
Data Sciencitist vs Data Engineer
- Overlapping skillset but different specialization
- Data Science is overcrowded,if you are really good then go for it. Otherwise choose Data Engineering if interested.
- Demand Ratio is 3:1 , where as supply ratio is 1:3.
- Concentrate and focus less crowded areas otherwise be the best in crowd.
Big Data
- data analysed is just 1% of all data produced, thus huge potential for growth in this field
- Cloud helps analyze such a large scale data easier compared to on prem systems
Different roles in Data Team
- 1x Infra (Physical, IaaS or Cloud)
- 1x DevOps (Stack automation,, Containers and Platform as a service)
- 2x **Data Engineer** (Data Pipelines, Data Automation, Data as a Service, Data Ingestion)
- 2x **Analytics** (1x Batch Analytics, 1x Real-Time Analytics and Predictive APIs)
- 1x AI and ML **Data Scienst** (Machine Learning and AI algorithms)
- 1x Front-End Dev (Web and Js developer, web and mobile apps)
- Specialized Roles
- 1x Network Architect
- 1x Security Engineer
- 1x Community Writer
- 1x Data Viz Developer
Data Engineer (https://github.com/igorbarinov/awesome-data-engineering & https://awesomedataengineering.com/)
- Programming
- Database
- SQL & NoSQL
- Data Lake, Data Warehouse / Analytics (Intermediate)
- Analysts specliaze in just this part of the process
- ETL / Data Processing Pipelines
- Batch & Stream
Google Cloud different products
- Compute
- Custom Hardware TPU, best choice for ML workloads
- ASIC -> Used by Red Hat OpenShift Container Platform(OCP) & IBM CP4A as well is nothing but a TPU.
- Storage
- Networking
- Networking in Google Cloud is powerful
- Private fiber optic cables for better networking with high bandwidth in Google Cloud
- Security.
- Security is a shared responsibilityl
- IBM Security Services product a.k.a Managed Security Services (M.S.S) is a Global Market Leader in Web and Cyber Security space worldwide
- Big Data & ML Product
Summary:
* Explosion or Growth of Data,
* Cloud playing an important role in Big Data,
* Data Scientist v/s Data Engineers (Demand and Supply),
* Big Data & Analytics + Cloud Migration/Modernisation
* With advent of technologies like Kubernetes, Docker containers and Red Hat OpenShift powered with IBM CP4A --- High Availability is achievable.
### Section 2: Product Recommendation using SQL & Spark
Storage: Apache HDFS/Amazon S3/Google Cloud Storage
Processing: Hadoop/MapReduce/Spark/YARN/HiveQL/Pig Latin/Kafka
Computation:
Recommendation is Part of AI
Any AI needs 3 things
1. Data
- Can be saved in databases sql or no sql
- Can be saved in data lake or data warehouse
- Data needs to be processed to bring it in format needed for ML model
3. Model
- Many programming languages can be used to write ML model. Tensorflow, Pytorch, MLLib, Scikit learn, xgboost, BigqueryML, R, ...
5. Infrastructure to run Model
### Section 3: Data Warehouse BigQuery and BigQuery ML
AI(Artificial Intelligence): learning from Data (Superset)
DL(Deep Learning): learning from Semi-structured or Un-structured Data like No/SQL. Data is generally big data, but it need not be.
- All about data from human sensors (machine sensors might have tabular data so ML)
ML(Machine Learning): learning from Tabular data or Structured data or RDBMS. Data could be big data or not.
DS(Data Science): Not just limited to data modelling, but also data storage, data processing etc. Cloud computing too
- collect Data, organize data in DW.Not only creating Models in compute engine
- Data warehouse is a central place to store/save all types of data which might be needed in future
BigQuery Data Warehouse: Best suited for big data analytics queries. Very easy product to use and it' serverless.
Bigquery
- https://cloud.google.com/blog/products/gcp/anatomy-of-a-bigquery-query
- https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview
- BigQuery ML
- https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create
- BigQuery
- https://google.qwiklabs.com/quests/147?parent=catalog
- https://google.qwiklabs.com/focuses/3460?parent=catalog
- BigQuery ML
- https://google.qwiklabs.com/focuses/1797?parent=catalog
- https://google.qwiklabs.com/focuses/14294?parent=catalog
- Other competitions where BigQuery ML can be used
1. https://www.kaggle.com/c/zillow-prize-1 1.2 million dollars prize
- Predicting price of the house
2. https://www.kaggle.com/c/two-sigma-financial-modeling 1 million dollars prize
- Predicting direction of the share movement. Up or Down
3. https://www.kaggle.com/c/deloitte-western-australia-rental-prices 1 million dollars prize
- Predicting Rental Price
4. https://www.kaggle.com/c/home-credit-default-risk 70k dollars prize
- Predicting Default or Not
5. https://www.kaggle.com/c/santander-customer-transaction-prediction 65k dollars prize
- Predicting Whether will buy or not
Different ways of doing AI
1. API
2. Ready to use API. No need to write model or give Data
3. Good enough but not Great
4. Auto ML
5. No need to write model. need to give only subset of Data and you get more accuracy than API
6. Custom Model
7. Your own Model, your own Data.
8. Complex to write custom model
9. Different ways of writing custom Model
10. Easiest: BigQuery ML
11. Easier: DataProc
12. Easy: Keras
13. Hard: Tensorflow
### Extra Links
- https://github.com/cncf/landscape/blob/master/README.md#trail-map
- http://comparecloud.in/
- https://landscape.cncf.io/
Non Technical
- https://www.youtube.com/watch?v=I64CQp6z0Pk&ab_channel=TED
- https://www.youtube.com/watch?v=TxxQTdYANLo&ab_channel=TEDxTalks
---
## Questions:
1. While both GCP and Red-Hat OCP use Linux ecosystem based CNCF tools ... which one is better or advantageous over another ?
- [ ] Every cloud has it's own advantages. For example gcp is better in scale & ML & networking. azure in better in ease of use, AWS in comprehensiveness. So it depends on the situation. I am not aware where red hat ocp is better. But in the end they all can do the same things other provider can
- [ ] **Please do try Red Hat OCP v4.5+ ... It has everything we need for cloud computing be it On-premise, IAAS, PAAS, SAAS, Muti-cloud, Hybrid Cloud and best in class Open Source Red Hat Linux platform ecosystem to support**
2. Why don't we use OSS like Apache HDFS/Spark for Big data & Analytics on any Cloud platform ?
- [ ] We do use them. Will be covering next
3. Can't we use Scala as a Language/Library for Spark instead of pySpark ?
4. How about billing, when using BigQuery?
5. Why aren't we using HiveQL for SQL or RDBMS ?
6. what level of Math / Stat knowledge required for Data Science?
7. Can we get some example of sample BigQuery ML, from code prospective?