# Design and Implementation of a Machine Learning Platform
# Course
This is the proposal for a 3-day mini-course consisting of intensive lecture sessions and hands-on labs at NCTU Computer Science department in December 2019.
The purpose of the course is to demonstrate the architecture of a Machine Learning Platform through lectures and hands-on coding sessions. At the end of the course, students are expected to learn how a production machine learning project is run, and they will learn how to leverage open source software (not just using it, but modifying it to your need).
For researchers, Machine Learning Systems is a new branch of research body that’s at the intersection of ML and system (Check out SysML 2018), so having a general understanding of machine learning systems will help build a foundation for future research innovation. For students, having a grasp of practical machine learning systems will help prepare them for a future career.
# What this is not for
This course is not a course for Machine Learning algorithms.

# Intro
Machine learning in the industry is not a research, but an engineering project. A machine learning project is a collaboration between multiple professionals: data scientists and data engineers.
A machine learning system, also known as a machine learning platform or a machine learning infrastructure, is a system that connects different stages of machine learning development cycle: data ingestion, data preparation, feature extraction, model development, model training, model deployment and model serving. It is a generic system that can support a variety of machine learning and data applications. Many big companies build their own machine learning platform, but most of them are composed of the above mentioned stages.
In this course, Apache Submarine is used as an example to illustrate the components required for performing these machine learning development stages.
# Program
3 days intensive lectures and hands-on coding sessions.
## Lectures
* **Day 1: Course overview [[doc](https://hackmd.io/@submarine/BJNbIL2CS)] 0.5hr**
* **Day 1: Machine learning project development process [[doc](https://hackmd.io/@submarine/r1EB2830B)] 1hr (Lecture 1)**
* data ingestion, data preparation, feature extraction, model development, model training, model deployment and model serving (1hr)
* Case studies: ML development process at Google, LinkedIn, Twitter, Airbnb, Netflix, Uber … (1hr)
* Google: Hidden Technical Debt in Machine Learning Systems
* LinkedIn Productive ML initiative and Tensorflow on YARN:
* Twitter Cortext
* Airbnb Bighead
* Netflix Machine Learning Infra (Metaflow)
* Uber Michelangelo
* **Day 1: What is a Machine Learning Platform [[doc](https://hackmd.io/@submarine/H1134Dh0H)] 1.5hr (Lecture 2)**
* Why do you need a system at all?
* Reduce time/effort to develop ML product, simplify workflow
* Repeatable process
* Support a variety of ML frameworks, users
* Easier to evaluate models
* Industrialization of AI / Productive ML / Productionizing ML...
* Why not just a notebook, like Juypter?
* Why not a data visualization/BI tool?
* What is there in the market
* Open source: Submarine, MLFlow, Kubeflow, TFX
* Commercial: CDSW, SageMaker, Azure Machine Learning Studio, H2O.ai SAS, RapidMiner, Dataiku, IBM DSX...
* Submarine
* Big data, distributed GPU training
* algorithm development, model batch training, model incremental training, model online services and model management
* **Day 1: Open source development [[doc](https://hackmd.io/@submarine/B1-IdP3AH)] 1hr (Lecture 3)**
* Leverage and participate in open source project communities
* **Day 2: Data Platform [[doc](https://hackmd.io/@submarine/HktJKPn0B)] 2 hr (Lecture 4)**
* Data Collection
* Data preprocessing
* Data wrangling
* BI, data verification tool, ETL tool
* Data/ML Open source ecosystem
* Landscape of data analytics (1hr - 2hr)
* Hadoop, Spark, Hive, Kafka, Flink, …
* ML: Apache Mahout, Spark Mllib, H2O, SystemML, Scikit-Learn,
* Security, Data privacy
* **Day 2: Feature extraction 0.5hr (Lecture 5)**
* (TBD) [[doc](https://hackmd.io/@submarine/BJ-NcPn0r)]
* **Day 2: Distributed resource management [[doc](https://hackmd.io/@submarine/BJE9A52CS)] 1.5hr (Lecture 6)**
* Specialized acceleration hardware 30min
* GPU, TPU, FPGA, Vector Engine
* YARN 15min
* Kubernetes, containers 15min
* Distributed GPU training [[doc](https://docs.google.com/document/d/1dFMyzzg9tc0WBxY8tDms6sjCGIBAlWwV4XjfC0D9Ksg/edit?usp=sharing)] 30min
* GPU, FPGA, specialized hardware scheduling
* Tensorflow on YARN
* Other systems (ToY, Tensorflow on Spark …)
* **Day 3: Machine learning framework [[doc](https://hackmd.io/@submarine/ByavQs30r)] 1hr (Lecture 7)**
* TensorFlow / MXNet / PyTorch / Caffe / Keras / XGBoost
* **Day 3: Machine Learning at Edge [[doc](https://docs.google.com/document/d/1ji5FriIcWybwCtxVi2SC_pnGszIcwlw-MUw6oz72hto/edit?usp=sharing)] 1hr (Lecture 8)**
* **Day 3: Model development/experiment/management [[doc](https://hackmd.io/@submarine/SyhJwsn0S)] 1hr (Lecture 9)**
* Airbnb Bighead
* **Day 3: Model deployment, model serving / inference / monitoring [[doc](https://hackmd.io/@submarine/rJZbd16CB)] 1hr (Lecture 10)**
* Serving infrastructure
## Hands-on sessions
Theme: (pick one)
* Flight delay
* Credit card fraud
* Predictive maintenance
* Telco churn
* Connected car
* Market basket analysis
Labs
* Labs: Day 1: Installation, setup (mini-submarine)
* Labs: docker, kubernetes
* Labs: ingest data,
* Labs: data preparation, feature engineering
* Labs: model development, model training, model management
* Labs: model serving
Prerequisite
* Graduate school level in Computer Science or Data Science major
* Programming language: Java, Python or C++
* Operating Systems, Network Systems or Distributed Systems
References:
* [SysML](https://mlsys.org/)
* [CSE 599W: Systems for ML](https://dlsys.cs.washington.edu/)
* [How to build the Next-Gen ML/AI Platform](https://www.slideshare.net/JoshYeh/nextgen-mlai-platform)
* [Facebook FBLearner](https://engineering.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/)
* [LinkedIn Productive ML initiative and Tensorflow on YARN](https://www.slideshare.net/xkrogen/hadoop-meetup-jan-2019-tony-tensorflow-on-yarn-and-beyond)
* [Twitter Cortext](https://cortex.twitter.com/)
* [Airbnb Bighead](https://databricks.com/session/bighead-airbnbs-end-to-end-machine-learning-platform)
* [Netflix Machine Learning Infra](https://www.slideshare.net/FaisalZakariaSiddiqi/ml-infra-for-netflix-recommendations-ai-nextcon-talk) / [Machine Learning Infrastructure for Netflix Recommendations
](https://www.youtube.com/watch?v=oS5-qEX5LC0)
* [Hidden Technical Debt in Machine Learning Systems](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)
* [The Institute for Ethical AI & Machine Learning the Machine Learning Engineer Newsletter](https://ethical.institute/mle.html)
###### tags: `2019-minicourse-submarine` `Machine Learning`