# Done: Big Data & ML on GCP Fundamentals - 27 Jan
> - Evernote Extra Resources Document: https://www.evernote.com/shard/s295/sh/f4201e7e-ce4f-0b9d-9cff-65ae77d41810/f8474443911872cb3f4a3344fb4dfbfd or is.gd/epavav
> - Colloborative Summary + Questions: https://hackmd.io/@Su5pOyXqRBq3U0yoTEpqlA/HkZcSNSyd or is.gd/atuyoy
---
## Colloborative Class Notes
### Big Data
- Global supply of data will continue to more than double every two years. **Big Data** is in ranges **PB**.
- Need systems to store, process and analyze such big data
- **Store**: NoSQL Databases & Data Warehouses (DataBase Person).
- **Process**: ETL Tools (Data Engineer) (https://cloud.google.com/training/data-ml#data-engineer-learning-path)
- **Analyze**: Data Warehouse. (Data Analyst) (https://cloud.google.com/training/data-ml#data-analyst-learning-path)
- **Extract Intelligence**: AI on Big Data (Data Scientist)(https://cloud.google.com/training/machinelearning-ai)====
#### Evolution of data:
- **Structured Data:** SQL
- **Live Data:** IOT data. First mover was Tibco
- **Unstructured and big Data:** Explosion of Internet, smartphones & Social Media.
#### Why do we need cloud ?
- Big Data needs Big Infrastructure
- Big Infrastructure needs a lot of Maintainance
- Cloud Simplieifes Maintainance
- That's why Cloud Computing
- So most legacy softwares are being moved from On prem to on cloud. It's called as migration
- cloud allows to scale users
- provides high availabilty.
- Allows to scale resources.
#### Migration
1. **Step 1:** Move application as is (Lift & Shift). We are simply running the same application but at different place
1. Most demand for Cloud Enabled people
3. **Step 2:** Improve the application to take advantage of Cloud Products.
4. **Step 3:** (Cloud Native / Serverless) Rewrite everything for cloud Native. No maintainence required
5. Can find many guide here: https://cloud.google.com/docs/tutorials
**Pricing on Cloud is all about rent.**
It's a pay per usage policy. You Pay for the resources as long as they are on. After you shut them down, you don't pay for them.
### Google Cloud Products
1. Compute Engine
1. Virtual Machine: You start virual machine & configure it with all the softwares you want and then start using it
2. Google cloud storage
- 99.999999999% availability
- high IO speed but not extremely High though
- For extreme high frequency operations use SSD not Google Cloud Storage
- cocacola, spotify etc
- only limitation; filesize cannot be greater than 5 tb.o
- Automatically can grow it's storage capacity, no need for user to manage it at all.
- Detailed Notes here: https://www.evernote.com/shard/s295/sh/4eae1b7f-4b9a-4ea8-98ed-c573bbce2a7a/082acabe8536aba84cb82999152e3f04
3. Networking
- Seperate course focusing on Networking
- Google Has It's own private network, with high bandwidth and low latency. So makes it a good choice for Big Data or Content delivery related applications
4. Cloud SQL: Fully managed RDBMS (MySql, SQL server and PostgreSQL)
- OLTP
- Migrate on prem sql to Cloud SQL
- Migrate oracle to cloud sql https://cloud.google.com/solutions/migrating-data-from-oracle-to-cloud-sql-for-mysql or https://cloud.google.com/solutions/migrating-mysql-to-cloudsql-concept
- Migrate others
6. DataProc: Nothing but Hadoop on GCP. (Called EMR/Elastic Mapreduce in aws and HDinsight in Azure)
- Cluster: Group of machines who work together parallely. It's divide the work and do it parallel. That is the basis of big Data
- Migrate on prem Hadoop to On cloud Dataproc
- https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-jobs
- https://cloud.google.com/solutions/migration/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc
- https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-overview
6. BigQuery:
- Petabyte-scale, Fully managed, datawarehouse (simple storage & analysis), OLAP
- Serverless, flexible pricing,
- Foundation of BI, Ai - Big data using SQL
- https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
- Best for large Data analysis.
- Also has Machine Learning capabilities. Aws just started to support ML via SQL
- Ideal for Big Data Analysis - Read Queries. Supports streaming inserts. Not highly optimized for Update and Deletes
7. BigQuery ML / BQML
- ML using SQL syntax, Very simple
- 11 algorythms - cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create
8. Pub-Sub
- Reliable and real-time messaging, used in micro services, asynchrounous communication
- Supports Millions messages per user (With millions of users using the Product at the same time)
- In whatsapp, the whatsapp server is the middle man between sender and reciver. Similarly pub-sub is the middleman in all senders & receivers
- Message Communication by Pub/Sub and the Messages from pubsub are generally read and Processed by Dataflow
- https://medium.com/teads-engineering/give-meaning-to-100-billion-analytics-events-a-day-d6ba09aa8f44
9. Dataflow
- For Extract-Transform-Load Data Pipelines
- Build data pipeline, process data, transport data & Transforms Data
- Severless: No need of creating and scaling the cluster. It Happens automatically
- Same code for both batch and streaming data
- Dataproc - when Hadoop pipeleine already existing vs dataflow - when creating new
10. Data Visualisation: Data Studio or Looker, native Google Cloud products for visualization
- Big Query has connectors for most other dashboading tools. So one isn't forced using Data Studio or Looker
11. ML
- Pre-trained model
- Vision, Translation, Speech APis
- AutoML - build custom model codeless (10 - 1000 examples)
- Custom model
- BiqQueryML-> SparkML-> Tensorflow
- Details
1. Pretrained APIs: Google's Data, Google's Model
- Extremely easy to use.
- Like clothes bought from store. Immediately ready for use.
- But not very powerful for specific requirements. Work well for general cases
2. Auto ML: Google's Model, Your own Subset of Data
- Min 10 instances per label to upto 1000 instances instances required
- More powerful than APIs. It's like customizing API for your own case by training it on your own Data
- It's like altering clothes bought from the store. Clothes are pre made but altered for you
3. Custom Model: Your own model, your own Data
- Extremely complex to build
---
### Artificial Intelligence
Any AI needs
1. data
- Data could be Relational Database saved in Cloud SQL
- Data could be Nosql Databased Saved in others
3. model written using programming language
- Spark or BigQuery ML or Scikit Learn or Tensorflow or Pytorch or APIs or AutoML or Many others
5. Infrastructure
- Created and managed by us
- Or created and managed by kubernetes
- Or created and managed automatically by Cloud
AI: learn from data not from defined rules
Different Types of AI
- Recommendation
- Prediction
- Machine Learning
- Deep Learning
Stages in Machine Learning:
- Collect Data
- Store Data
- Oraganize Data
- Data Preparation
- Data Visualization
- Create Model
- Experiment
- Result
- For detailed information check https://www.evernote.com/shard/s295/sh/f4d591ac-30da-4941-b297-99e9f7bb3f29/a5f0ccda4de091723a51a55556522245
---
## Questions
- [x] Do we have mock tests available for Google Cloud Certification? If yes, please do give us the link for the same.
- [ ] ANS: Official mock tests https://cloud.google.com/certification/sample-questions/cloud-architect & https://cloud.google.com/certification/sample-questions/data-engineer. Wouldn't recommend this https://gcp-examquestions.com/. Focus on understanding the concepts. Questions banks like above won't be helpful in the exam
- [x] data visualization will come under data scientist ?
- [ ] ANS: Yes and under Business Intelligence as well
- [x] Is it necessary to master Bigdata stuff before we learn GCP Data stuff?
- [ ] ANS: No
- [x] In which step does monolithic to microservice will fall in migration?
- [ ] ANS: Application Modernization using Straggler Pattern. Check the pattern details here https://docs.microsoft.com/en-us/azure/architecture/patterns/strangler-fig
- [x] Does https://hackmd.io/ use GS ?
- [ ] ANS: No. GS isn't great for collaborative file editing. For that firebase might be better
- [X] do microsoft/aws as well offer private internet?
- [ ] ANS: Yes they both have started. aws in 2018 and azure in 2014
- [x] What are the challenges to introduce Databases ex. Oracle for cloudSQL... Like SQL server was not available before but it is now:
- [ ] ANS: Migration support available. https://cloud.google.com/solutions/migrate-oracle-workloads
- [X] Do we have a backup of the data stored in Cloud SQL, like we do in hadoop? Is this done automatically or user needs to do it?:
- ANS: it is Done automatically
- [x] Can you please share resources for the GKE and Kubernetes?
- [ ] ANS: I don't have them compiled. Not my area of expertise
- [x] Can we say that datawarehouse is equivalent to using nosql database like cassandra in background?
- [ ] Optimizing nosql database to it's max for easy of quering, then you get data warehouse
- [x] Can you please share the extra resouces for data visualization in big query?
- [ ] Check data to insight course. It covers bigquery in depth
- [x] Do you have the study material for Big Query database administration? Idea is to understand big query from DBA perspective - Or can we know is it necessary to have a DBA for BQ with many activities automated in BQ.
- [ ] Check data to insight course. It tells u everything to know about bigquery. Many of activities of DBA are actually indeed automated in bigquery. That course goes into detail of everything to know about BQ.
- [x] Do you have any resources which will help us understand the BQ architecture, meaning components and their function, query processing, execution steps generation, etc.
- [ ] Data to Analyst course
- [x] Regarding the search using Lake keyword, are all the photos tagged when the photo is uploaded based on the image content? or based on search keyword, it starts predicting?
- [ ] Good question. They are all tagged when photo is uploaded. And when you search, those tags are searched instead of searching every image. :)
- [x] can we acquire skill badge and do labs for free?
- [ ] Yes. Check evernote section of free qwiklabs
- [x] I am a MBA Grad, so I wish to do machine learning Google cloud certification. Should I purse any other certification prior to this one ?
- [ ] It will be useful but don't learn it for learning modelling urself. In Data Science field, business analyst and data science managers are in short supply. Learn whatever is needed for it.