# Done: BDML Fundamentals - 18 March 2021 > Notes Document: is.gd/evajif or https://hackmd.io/@ajinkyakolhe112/HJWpKrxEO > Slides: https://1drv.ms/b/s!Aq6hYeVV5o6DhstxiE85WHGWV5bISQ?e=FJCScx --- ## Notes ### Module 1 General GCP Resources - Check the learning paths: https://cloud.google.com/training#learning-paths - No recording allowed, but Revist today’s course online (4 different ways): 1. https://www.coursera.org/learn/gcp-big-data-ml-fundamentals (Audit the course then you can do it for free. https://www.classcentral.com/report/coursera-signup-for-free/) or on pluralsight 2. https://cloudonair.withgoogle.com/events/apac-gcp-fundamentals-series 3. https://cloudonair.withgoogle.com/events/cloud-onboard-data-fundamentals 4. https://www.youtube.com/playlist?list=PLY7sQ59Bufns3VafkhnHpbdbGBrTxSXwi - https://github.com/gregsramblings/google-cloud-4-words Key Summary of Important Terms - What is Cloud Computing - Cloud computing is abstraction on infrastructure for compute and storage. - Cloud is a place like amazon, where you shop for required compute and storage - You rent the hardware for compute like CPUs and hardware for storage like HDD, SSD or Databases. - Pay per use. When you are done with using something, you simply stop it. And then you no longer pay for it.  - If I need 5000 machines for 2 minutes, then I start them, use them, and after using, delete them.  - Cloud Service provider maintains all the hardwares, you simply rent infrastructure as and when needed. - Cloud Offers easy way to scale, is more secure and is cost effective (cost effective if done right and if maintained and monitored properly) - Using cloud is simple, but not necessarily easy - Can focus just on developing software and let the cloud provider take care of the infrastructure..  - IAAS, PAAS - Serverless 1. When you don't have to manage the servers - Fully Managed ( Serverless ) 1. Every single aspect of the product is managed by the service provider - Managed Product 1. Partial aspect of the product is managed by the service provider - Infrastructure as a Service is (Product is Managed completely by you) - https://www.episerver.com/articles/pizza-as-a-service  and https://www.bmc.com/blogs/saas-vs-paas-vs-iaas-whats-the-difference-and-how-to-choose/ - Example of above 1. On Prem: Your hardware and you manage everything 1. Eg: You & Your Kitchen. You are cooking, cleaning, preparing the ingredients, every single aspect 2. Vendor: Your hardware and vendor manages everything 1. Eg. Cook you have hired at your own home. Your kitchen, but its managed by the cook 3. Cloud but Infrastructure as a Service: Hardware is on cloud but you manage everything 1. Eg. if you got airbnb/oyo which has a kitchen. And you cook and manage everything but in kitchen of airbnb. 2. Hardware is not yours. You are renting it.  4. Cloud but Managed Product: Partial Management is Done by you and the rest is done by the Cloud Provider 1. Airbnb/oyo kitchen (rented) and cooking (ready to eat food)  5. Cloud but fully Managed: Entire Management is Done by the Cloud provider, nothing needed from you.  1. Going to restaurant to eat. You don’t own the kitchen nor you cook the food 6. Consequence of this is Migration 1. On Prem on Cloud as is Migration or Lift & Shift or Rehosting Migration (means on cloud but infrastructure as a service) 2. Optimize for Cloud. (Using managed Product) 3. Rewrite for the cloud native. (Use Fully Managed Product) - https://s7280.pcdn.co/wp-content/uploads/2017/09/saas-vs-paas-vs-iaas.png - Cloud can be used as IaaS or PaaS.  - Managed Product & Fully Managed Product - Managed Product - Example Cloud SQL.  - Backups, Check pointing, Logging, Networking taken care by Google - Or I can install mysql on Compute Instance and then do everything myself - DataProc, you need to create a cluster - Need to be scaled manually. - But in some cases, auto scale capability has been added as extra feature as after thought - Fully Managed Product / Serverless Product - Dataflow or Pub/Sub - Are almost always autoscale - Scale - Autoscale - Change according to demand automatically either increases or decreases without user intervention - Eg: room of requirements from Harry Potter - Scale - You have to execute the command which can resize the resources either up or down. This is not automatic - eg: Ant Man. Red is shrinking and blue for big. Scale not auto scale. It needs to be called. Doesn’t happen automatically - Two types of scaling - Vertical Scaling - Increased demand is met by increasing the power of the same machine - Single machine computing - Horizontal Scaling - Increased demand is met by increasing the number of machines working together - All Distributed computing products are horizontal Scaling - Dataproc - SQL for write is not horizontal Scale because it's not distributed computing - SQL for write can only do vertical scale - Spanner is distributed computing, hence it's horizontal scale - This is a sustainable way because there is a upper limit to vertical scaling and it’s very expensive to build more powerful machines - Migration: Journey of on Prem to Cloud, Also called as Migration - Migration has 3 stages. You migrate data & compute to the cloud. 1. Migrate as is. No re write. Just move existing application from on premise to on cloud 2. A bit of rewrite to improve the performance on the cloud 3. Complete rewrite using cloud native products or kubernetes based products 1. Efforts of rewrite vs Efforts of migrating  1. Airline Industry 2. If it's working, don't touch it.  3. Modernization means moving to cloud 1. Rewrite in Cloud Native Types of Jobs on Cloud 1. Migration. We move the original codebase from on prem to on cloud - Google Cloud has partner ecosystem and parterns are heavily involved in migration. - Migration Jobs 1. Step 1: Move application as is (Lift & Shift). We are simply running the same application but at different place 1. Most demand for Cloud Enabled people 2. Step 2: Improve the application to take advantage of Cloud Products. 3. Step 3: (Cloud Native / Serverless) Rewrite everything for cloud Native. No maintainence required 4. Can find many guide here: https://cloud.google.com/docs/tutorials/ 2. Cloud Native Development or New application: Generally startups do this - Cloud Architect is Design & Develop the system - Design consideration for Scale for security for future growth all these are part of architect’s job - Cloud Developer just Develop the system - Architect vs Civil Engineer - Civil Engineer Builds the building according to the blueprint. - Architect designs the blueprint considering the feasibility of the building. - Movies Reference: Inception Dream Designer: Architect 3. Feature Addition on prexisting cloud hosted product 4. Maintainance of the cloud hosted Product Evolution to Cloud 1. First On Prem either server or data center - My Infra, I buy it, I manage it 2. Then On Cloud. (First migration called lift & shift) - Someone else’s infra - Renting not Buying. But renting gives flexibility - House Rent vs Buy? - Renting gives us abilrity to change 3. Then Cloud Native / Serverless / Fully Managed - Someone else’s Infra, but completely automatically managed - Iron Man? Mark I vs Mark 85 in End Game. nano tech. Migration to Cloud 1. Lift & Shift. Move as is 2. Optimize for Cloud. Tiny rewrite to improve the performance 3. Complete Rewrite for Cloud Native or Containerize for Hybrid cloud via Container Different roles in Data Team Hypothetical Scenario - 1x or 2x Infra (Physical, IaaS or Cloud) - 1x DevOps (Stack automation, Containers and Platform as a service) - 2x or 3x Data Engineer (Data Pipelines, Data Automation, Data as a Service, Data Ingestion) - 2x Analytics (1x Batch Analytics, 1x Real-Time Analytics and Predictive APIs) - 1x AI and ML Data Scienst (Machine Learning and AI algorithms) - 1x Front-End Dev (Web and Js developer, web and mobile apps) - Specialized Roles - 1x Network Architect - 1x Security Engineer - 1x Data Viz Developer OnPrem vs Cloud vs Serverless Cloud - OnPrem - User configured, user managed and user maintained - Cloud - User configured, provider managed and provider maintained - Different Ways of Using Cloud - (on prem is user bought, user configured & user maintained) - Infrastructure as a Service: User configured, user maintained & Provider provided - Managed Product (Partially Managed by us and Partially by the Provider): User configured, provider managed & maintained but partial work still is needed by user - Fully Managed / Serverless: Everything is done by the provider. User just codes - Example restaurant - IaaS: You cook in the restaurant. - PaaS: Buffet self service - Serverless: Waiter serves you prepared food - https://www.episerver.com/articles/pizza-as-a-service & https://www.bmc.com/blogs/saas-vs-paas-vs-iaas-whats-the-difference-and-how-to-choose/ - Serverless Cloud - Fully automated and no configuration required. Big Data & ML Fundamentals: Which are IaaS, Fully Managed & Managed. - Infrastructure as a Service (Fully Unmanaged by the Provider): Compute Engine - Managed Products: SQL, Dataproc, BigTable, Spanner - Fully Managed Products: BigQuery, DataFlow, Pub/Sub, Cloud Storage, DataStore Data Engineer (https://awesomedataengineering.com/ & https://github.com/igorbarinov/awesome-data-engineering & https://github.com/datastacktv/data-engineer-roadmap ) - Programming - Database - SQL & NoSQL - Data Warehouse / Analytics & Data Lake - ETL / Data Processing Pipelines - Batch & Stream Big Data - data analysed is just 1% of all data produced, thus huge potential for growth in this field - Cloud helps analyze such a large scale data easier compared to on prem systems Data Scientist vs Data Engineer - Overlapping skillset but different specialisation - Data Science is overcrowded,if you are really good then go for it. Otherwise choose Data Engineering if interested. - Demand Ratio is 3:1 , where as supply ratio is 1:3. - Concentrate and focus less crowded areas otherwise be the best in crowd. Database vs Data warehouse vs Data Lake - Database: Place to save your Data. Either big or not big Data. R&W to it. Also called OLTP - Data warehouse: Data storage for Big Data and Optimized for analysis read queries. Most of the queries are read only. OLAP - Data Lake: Data storage for Big Data and optimized just for storage Google Cloud different products - Compute - Custom Hardware TPU, best choice for ML workloads - ASIC -> Used by Red Hat OpenShift Container Platform(OCP) & IBM CP4A as well is nothing but a TPU. - Storage - Networking - Networking in Google Cloud is powerful - Private fiber optic cables for better networking with high bandwidth in Google Cloud - Security. - Security is a shared responsibilityl - IBM Security Services product a.k.a Managed Security Services (M.S.S) is a Global Market Leader in Web and Cyber Security space worldwide - Big Data & ML Product Summary: - Explosion or Growth of Data, - Cloud playing an important role in Big Data, - Data Scientist v/s Data Engineers (Demand and Supply), - Big Data & Analytics + Cloud Migration/Modernisation - With advent of technologies like Kubernetes, Docker containers and Red Hat OpenShift powered with IBM CP4A — High Availability is achievable. ### Module 2 Storage: Apache HDFS/Amazon S3/Google Cloud Storage Processing: Hadoop/MapReduce/Spark/YARN/HiveQL/Pig Latin Recommendation is Part of AI Any AI needs 3 things 1. Data - Can be saved in databases sql or no sql - Can be saved in data lake or data warehouse - Data needs to be processed to bring it in format needed for ML model 2. Model - Many programming languages can be used to write ML model. Tensorflow, Pytorch, MLLib, Scikit learn, xgboost, BigqueryML, R, … 3. Infrastructure to run Model Products 1. DataProc: Managed Hadoop. Nothing but Hadoop on GCP. (Called EMR/Elastic Mapreduce in aws and HDinsight in Azure) - Cluster: Group of machines who work together parallely. It’s divide the work and do it parallel. That is the basis of big Data - Migrate on prem Hadoop to On cloud Dataproc - https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-jobs - https://cloud.google.com/solutions/migration/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc - https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-overview 2. Cloud SQL: Managed RDBMS (MySql, SQL server and PostgreSQL) - OLTP - Migrate on prem sql to Cloud SQL - Migrate oracle to cloud sql https://cloud.google.com/solutions/migrating-data-from-oracle-to-cloud-sql-for-mysql or https://cloud.google.com/solutions/migrating-mysql-to-cloudsql-concept - Migrate others Bigquery - https://cloud.google.com/blog/products/bigquery/anatomy-of-a-bigquery-query - SQL query >select language, sum(views) views >from `bigquery-samples.wikipedia_benchmark.Wiki100B` >where REGEXP_CONTAINS(title,".*o.*o.*o.*") >group by language >order by views desc - https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview - BigQuery ML - https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create - BigQuery - https://www.qwiklabs.com/quests/147?parent=catalog - https://www.qwiklabs.com/focuses/3460?parent=catalog - BigQuery ML - https://www.qwiklabs.com/focuses/1797?parent=catalog - https://www.qwiklabs.com/focuses/16547?parent=catalog - https://www.qwiklabs.com/focuses/14294?parent=catalog - Other competitions where BigQuery ML can be used 1. https://www.kaggle.com/c/zillow-prize-1 1.2 million dollars prize - Predicting price of the house 2. https://www.kaggle.com/c/two-sigma-financial-modeling 1 million dollars prize - Predicting direction of the share movement. Up or Down 3. https://www.kaggle.com/c/deloitte-western-australia-rental-prices 1 million dollars prize - Predicting Rental Price 4. https://www.kaggle.com/c/home-credit-default-risk 70k dollars prize - Predicting Default or Not 5. https://www.kaggle.com/c/santander-customer-transaction-prediction 65k dollars prize - Predicting Whether will buy or not --- ## Questions - [ ] What is OLTP? - [ ] Can a Finance Guy be Data Engineer - [ ] Cluster in Google DataProc can be increased dynamically?