AWS ML::Data Collection

# AWS ML::Data Collection [TOC] ---- Machine Learning Cycle throughout the course: ![](https://i.imgur.com/7Ppi9bL.png) ---- ## Data Collection Concepts ---- ### Good Data ![](https://i.imgur.com/MCASX6l.png) ---- ## General Data Terminology ---- ### Terminology - dataset = input data = training/testing data - Structured Data: ![](https://i.imgur.com/4KlQLgn.png) - Unstructured Data: ![](https://i.imgur.com/SIMERnY.png) - Semi-Structured Data ![](https://i.imgur.com/L1njtIA.png) ---- ### Terminology - Data Warehouse : Many sources, many formats - Required cleaning to run in BI tools. - Perabytes or terabytes - Datalake: no preprocessing done ![](https://i.imgur.com/zRETROE.png) ---- ### Data Repositories Summary ![](https://i.imgur.com/htciIPJ.png) ---- #### Data Types - Labeled data - Unlabeled data - Audio Stream - Social Media Stream ![](https://i.imgur.com/ctsR2kU.png) ---- #### Feature Types - Categorical ![](https://i.imgur.com/QIsJs6f.png) - Continuous ![](https://i.imgur.com/7AEJSvE.png) ---- #### Data by Application - Text Data(Corpus Data) - Ground Truth Data: trusted and labelled data - Image Data - datasets with tagged images - Time Series Data --- ## AWS Data Stores > How to get our data into AWS -> S3 > Go to place for storing ML data > --- ### S3 Review ![](https://i.imgur.com/wwBvHIB.png) ---- ### RDS > Contain structured Data ### Dynamo DB > KV Pairs -> schemeless data, unstructured, semi-structured data > ---- ### AWS Redshift > Datawarehousing solution > ![](https://i.imgur.com/56zTJah.png) #### Redshift Spectrum ![](https://i.imgur.com/3gQAtMC.png) ### Timestream ![](https://i.imgur.com/Km6g4pQ.png) > Fully-managed DB, BI tools SQLite query > ### Document DB ![](https://i.imgur.com/V3T9RsO.png) > Migrate mongoDB data > --- ## AWS Migration Tools - Data Pipeline - DMS - AWS Glue ---- ### AWS Data Pipeline > Could do ETL > ![](https://i.imgur.com/kiwDg3c.png) ---- #### Activities ![](https://i.imgur.com/cL98dCN.png) ---- #### Create Pipeline ![](https://i.imgur.com/NQwCqnE.png) ---- ### DMS > Database-Migration-Service -> could do it on S3 > ![](https://i.imgur.com/2Hu46Dm.png) -transfer data between 2 RDS, can also output results to S3. -transfer can be: --homogenous(MySQL-MySQL) and --heterogenous(MySQL-SQLServer). DMS supports only column name change. ---- ### AWS Glue > Fully managed ETL Service (Touch Load today) > Create data within a data catalog [Data classifier:](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-activities.html) ![](https://i.imgur.com/bIBC7r3.png) ---- #### AWS Glue ![](https://i.imgur.com/WqKxe2k.png) Glue connection via JDBC connection. ---- ### Summary ![](https://i.imgur.com/DcWkpvZ.png) --- ## AWS Helper Tools ---- ### EMR ![](https://i.imgur.com/ZRsUN1Y.png) Use DataPipeline ---- #### EMR Create Cluster ![](https://i.imgur.com/asFSBNd.png) ---- ### AWS Athena > Run SQL queries on S3 ![](https://i.imgur.com/8rVPtOg.png) ---- ### Redshift Spectrum Vs Athena ![](https://i.imgur.com/HGecwiM.png) --- ## QUIZ ### Question 1 ![](https://i.imgur.com/TqiF3yf.png) You have been tasked with converting multiple JSON files within a S3 bucket to Apache Parquet format. Which AWS service can you use to achieve this with the LEAST amount of effort? --- ### Question 2 ![](https://i.imgur.com/FzLl6eq.png) You are a ML specialist within a large organization who needs to run SQL queries and analytics on thousands of Apache logs files stored in S3. Your organization already uses Redshift as their data warehousing solution. Which tool can help you achieve this with the LEAST amount of effort? --- ### Question 3 ![](https://i.imgur.com/ZWPV7DF.png) Which Amazon service allows you to build a high-quality training labeled dataset for your machine learning models? This includes human workers, vendor companies that you choose, or an internal, private workforce. --- ### Question 4 ![](https://i.imgur.com/viCL1Ax.png) You are a ML specialist who is setting up a ML pipeline. The amount of data you have is massive and needs to be set up and managed on a distributed system to efficiently run processing and analytics on. You also plan to use tools like Apache Spark to process your data to get it ready for your ML pipeline. Which setup and services can most easily help you achieve this? --- ### Question 5 ![](https://i.imgur.com/8lr1MhQ.png) You are a ML specialist within a large organization who needs to run SQL queries and analytics on thousands of Apache logs files stored in S3. Which set of tools can help you achieve this with the LEAST amount of effort? --- ### Question 6 ![](https://i.imgur.com/5yLNcNk.png) Your organization has given you several different sets of key-value pair JSON files that need to be used for a machine learning project within AWS. What type of data is this classified as and where is the best place to load this data into? --- ### Question 7 ![](https://i.imgur.com/Tu7LWHl.png) In general within your dataset, what is the minimum number of observations you should have compared to the number of features? --- ### Question 8 ![](https://i.imgur.com/MV9tKAn.png) An organization needs to store a mass amount of data in AWS. The data has a key-value access pattern, developers need to run complex SQL queries and transactions, and the data has a fixed schema. Which type of data store meets all of their needs? --- ### Question 9 ![](https://i.imgur.com/LfdZdO1.png) You have been tasked with collecting thousands of PDFs for building a large corpus dataset. The data within this dataset would be considered what type of data? --- ### Question 10 ![](https://i.imgur.com/jt2l8jD.png) You have been tasked with setting up crawlers in AWS Glue to crawler different data stores to populate your organization's AWS Glue Data Catalogs. Which of the following input data store is NOT an option when creating a crawler? https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html ![](https://i.imgur.com/dxA1smt.png) ### Question 11 ![](https://i.imgur.com/FL6KIGG.png) When you train your model in SageMaker, where does your training dataset come from? #### ***SageMaker does no longer requires your data to be in S3 for training a model s3, dynamodb, redshift, rds - not an option ___ ### Question 12 ![](https://i.imgur.com/j3lli3c.png) You are trying to set up a crawler within AWS Glue that crawls your input data in S3. For some reason after the crawler finishes executing, it cannot determine the schema from your data and no tables are created within your AWS Glue Data Catalog. What is the reason for these results? ___ ### Question 13 ![](https://i.imgur.com/241HmfC.png) You are a ML specialist within a large organization who helps job seekers find both technical and non-technical jobs. You've collected data from a data warehouse from an engineering company to determine which skills qualify job seekers for different positions. After reviewing the data you realise the data is biased. Why? ___ ### Question 14 ![](https://i.imgur.com/7BgHSmV.png) You are a ML specialist working with data that is stored in a distributed EMR cluster on AWS. Currently, your machine learning applications are compatible with the Apache Hive Metastore tables on EMR. You have been tasked with configuring Hive to use the AWS Glue Data Catalog as its metastore. Before you can do this you need to transfer the Apache Hive metastore tables into an AWS Glue Data Catalog. What are the steps you'll need to take to achieve this with the LEAST amount of effort? ___