Practical Data Science on AWS
Instructor: Heiwad Osman (heiwad@amazon.com)
Morning Task List
Sign up for an account at https://aws.qwiklabs.com for access to the lab environment - Please use the same email you used in your registration profile.
Sign up for an account at https://online.vitalsource.com - You may use a personal email if you prefer. I will email out a code later in the morning that you can use to claim your course eBooks. (from no-reply@gilmore.ca). This is the only way to receive the ‘presentation’.
(Recommended) Get the VitalSource Bookshelf app from so you can download your ebooks.
Notes
We’ll be running this class on Central Time!
Expect 9AM - 4PMish.
Lunch break from 12 - 1PM Central Time!
Synopsis:
This class is an introduction to both data science process and basic sagemaker functionality.
Agenda:
Machine Learning intro/review
Introduction to Amazon Sagemaker
Data Visualization and Analysis (in Sagemaker)
Training & Evaluating Models with Sagemaker
Tuning Model Hyperparameters
Deploying Models to Sagemaker Endpoints
Additional topics & Features
Questions?
Resources:
How to prepare for AWS ML Exam? https://aws.amazon.com/certification/certification-prep/
https://www.aws.training/Details/eLearning?id=42183
https://developers.google.com/machine-learning/crash-course
https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
How to use the trained model artifact locally?
import pickle as pkl
import tarfile
t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()
model = pkl.load(open(model_file_path, 'rb'))
# prediction with test data
pred = model.predict(dtest)
Recommendations for how to view/output which features the Tuned model is using for predictions?
Load the model locally, then
Plot a Single XGBoost Decision Tree
xgb.plot_tree(model, num_trees=4, ax=ax)
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html
Can you recommend a good source/training that goes through the python code we are working with?
Kaggle has great intro tutorials for python and the libraries we used in the class
https://www.kaggle.com/learn/python
https://www.kaggle.com/learn/data-visualization
https://www.kaggle.com/learn/intro-to-machine-learning
This book is great for python developers without ML background.
https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow-dp-1492032646/dp/1492032646/ref=dp_ob_title_bk
How to keep learning about AWS Sagemaker?
Try our free course on edx.org https://www.edx.org/course/simplifying-machine-learning-app-development-with-amazon-sagemaker
Also, find free machine learning courses available on aws.training
https://aws.amazon.com/training/learning-paths/machine-learning/
Can we use a built-in algorithm for semantic segmentation of an image?
Yes, see the example: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/semantic_segmentation_pascalvoc/semantic_segmentation_pascalvoc.ipynb
How can I keep my data inside the VPC?
Use VPC Endpoints such as for Rekognition and Sagemaker
Can I process sensitive information with AWS AI services
You can see which services are eligible to process HIPAA protected information such as https://aws.amazon.com/blogs/machine-learning/aws-expands-hipaa-eligible-machine-learning-services-for-healthcare-customers/ and https://aws.amazon.com/about-aws/whats-new/2018/05/amazon-rekognition-achieves-hipaa-eligibility/
What do the different instance types for sagemaker cost?
https://aws.amazon.com/sagemaker/pricing/instance-types/
Do you have algorithms that can be trained ‘online’?
Some of the built-in sagemaker algorithms support incremental training. Otherwise, you can bring your own algorithm that starts from pretrained weights instead of from scratch. See https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html
How to get the latest docker image for an algorithm?
Add repo_version="latest")
More labs and classes?
https://www.edx.org/course/amazon-sagemaker-simplifying-machine-learning-appl
Amazon.qwiklabs.com
I’m still having trouble understanding Bias vs Variance intuitively. What do you have?
See this ML Cheat sheet for a good diagram
This discussion has a good answer: What is the meaning of term variance in machine learning model?
And our AWS documentation has some simple heuristics at Model fit underfitting vs overfitting
I want to understand the math for bias-variance decomposition. How is it calculated and for which algorithms does it apply?
MLXTend Python library has some functions to try to calculate it. Their documentation describes their Bias Variance decomposition method pretty well
And Bias–variance tradeoff from Wikipedia provides some math derivations.
How do I view the coefficients of my linear learner model?
The sagemaker linear learner model is saved as an mxnet model file in S3. You can download model.tar.gz, untar and then unzip the algo file. Then load with mxnet as described at AWS forums or at Stack Overflow
See code example below.
import os
import mxnet as mx
import boto3
bucket = "<your_bucket"
key = "<your_model_prefix>"
boto3.resource('s3').Bucket(bucket).download_file(key, 'model.tar.gz')
os.system('tar -zxvf model.tar.gz')
# Linear learner model is itself a zip file, containing a mxnet model and other metadata.
# First unzip the model.
os.system('unzip model_algo-1')
# Load the mxnet module
mod = mx.module.Module.load("mx-mod", 0)
# model's weights
mod._arg_params['fc0_weight'].asnumpy().flatten()
# model bias
mod._arg_params['fc0_bias'].asnumpy().flatten()
# Using the model for prediction
# First create a mxnet data iterator:
# https://mxnet.incubator.apache.org/tutorials/basic/data.html#reading-data-in-memory
# https://mxnet.incubator.apache.org/tutorials/basic/data.html#reading-data-from-csv-files
data_iter = create_data_iter()
# Next bind the module with the data shapes.
mod.bind(data_shapes=data_iter.provide_data)
# Predict
mod.predict(data_iter)
Why do we oversample the minority class when we have a class imbalance for classification?
You need to make sure that the learning algorithm is seeing enough examples of the minority class such that the weight optimization properly gets updated. This is one technique for trying to rectify class imbalance.
See the following example notebook for more.
https://www.kaggle.com/tanlikesmath/oversampling-mnist-with-fastai
How do I use batch predictions instead of real-time endpoints?
See example: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_batch_transform/introduction_to_batch_transform/batch_transform_pca_dbscan_movie_clusters.ipynb
Where can I learn more about Deep Learning on AWS?
We have a training offering for introduction to deep learning models available. Here is the description https://aws.amazon.com/training/course-descriptions/deep-learning/