Course Overview

# Course Overview Part of mini-course of [Apache Submarine: Design and Implementation of a Machine Learning Platform](https://docs.google.com/document/d/1Cf1My37AHOvdCqpsBmMpWy7XPuJUvRrY2L6kSBJb7n0/edit?usp=sharing). Day 1, [Lecture 0](https://cloudera2008-my.sharepoint.com/:p:/g/personal/weichiu_cloudera2008_onmicrosoft_com/EfNzfC8Z0G1JpMSZ0wMBq3kB71jjBgBU22vKP8j0zH_gZg?e=Jg64H7) Spend 20-30 minute to walk through the topics in this course. Self intro: my academic background, Cloudera, Apache Hadoop/Submarine committer, TDEA This course will be partly practitioner oriented, and partly academic. This course is for: Is not for: Machine Learning algorithms It is also not for learning ML tools or frameworks. This is a new and active area and many of the topics I am going to talk about did not exist even 2 years ago. I don’t want you to learn something that becomes irrelevant in 2 years. What matters is the concept. Learn what’s useful 10 or even 20 years out. For machine learning engineers/ data scientists * demonstrate the architecture of a Machine Learning Platform. * how a production machine learning project is run. For systems researchers/ machine learning infrastructure developers * leverage open source software. * general understanding of machine learning systems, to inspire future research ideas. For students * having a grasp of practical machine learning systems. Ask the class for the composition: how many are undergrad, grad student, graduated. On day 1, I will give a glimpse of production machine learning projects at some of the most advanced technology companies in the industry. I will introduce the MLP at each of these companies, and I will offer several case studies of these systems. Subsequently, I will talk about Machine Learning Platforms, why are they needed, and why simple tools, such as notebooks, are not sufficient. Despite many systems I cover above, they all share similar architecture and process: data collection -> data preparation -> feature engineering -> model development -> model serving. Unlike ML framework and ML libraries which are mostly open source, there is not a good open source MLP that is meant to support large scale data/ML projects. Enter Submarine. Submarine. Open Source community and development. I want to use this opportunity to evangelize open source development, systems research and talent acquisition. On Day 2: The main focus of this day is Data Platform. We will look at this from a data engineer’s perspective, as well as a data scientist’s perspective. We will cover the current landscape of data analytics market. In the end, we will talk about the data privacy and governance. From now on, we enter the realm of data scientists/ machine learning engineers. Feature extraction, we will talk about feature engineering, and feature store. Distributed resource management. It is common to use specialized hardware to accelerate compute. We will talk about how these hardware are essential, and how specialized hardware is supported in a large scale environment. On Day 3: We spend the final day on machine learning development, management and serving. Lab Set up a Hadoop cluster Set up Submarine Set up data source Set up ETL pipeline Set up model development Set up model serving infrastructure  ###### tags: `2019-minicourse-submarine`, `Machine Learning'