Lecture 5: Feature extraction

# Lecture 5: Feature extraction Part of mini-course of [Apache Submarine: Design and Implementation of a Machine Learning Platform](https://hackmd.io/@submarine/B17x8LhAH). Day 2, Lecture 5 * 0.5 hr * Feature * Data Management Challenges in Production Machine Learning * [Data Management Challenges in Production Machine Learning (publication)](http://delivery.acm.org/10.1145/3060000/3054782/p1723-polyzotis.pdf?ip=114.32.23.188&id=3054782&acc=OA&key=4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E5945DC2EABF3343C&__acm__=1576987615_a556de7e328756ad0213be95741b8a5f) * [Data Management Challenges in Production Machine Learning (ppt)](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46178.pdf) * [SIGMOD Tutorial 9, Friday May 19, 9 AM ](https://www.youtube.com/watch?v=U8J0Dd_Z5wo) * Data Lifecycle Challenges in Production Machine Learning: A Survey * [Data Lifecycle Challenges in Production Machine Learning: A Survey (publication)](https://sigmodrecord.org/publications/sigmodRecord/1806/pdfs/04_Surveys_Polyzotis.pdf) * A lot of research is on data flow optimization, that is, how to make training and serving more efficient. But data itself is important and under studied. The observation comes from the development of TFX, assuming semi-continuous data ingestion, where training data comes in batches. * Three personals in a production ML system: ML expert, SWE and SRE. * Data understanding, data validation, data cleaning, data enrichment. * Validation is important. * A bad value can impact downstream pipeline. Unlike traditional DB research in structured data, modern ML often operate on semi-structured/ unstructured data, and input data can come from multiple sources. Schema is typically not enforced during data ingest phase (schema-on-read). Erroneous data need validation/cleaning. Or in general, that training data is not a good proxy of the serving data. * Like what’s stated in the Airbnb Bighead example, inconsistency between offline data (used for training) and online data (which is served) is a big problem. They must go through the same data processing pipeline to ensure consistency. As an example, inconsistency can arise if the training data is encoded as “0”/”1” while the serving data is encoded as boolean 0/1. * Feature Store * [Introducing Feast: an open source feature store for machine learning](https://cloud.google.com/blog/products/ai-machine-learning/introducing-feast-an-open-source-feature-store-for-machine-learning) * HORIZONTALLY SCALABLE ML PIPELINES WITH A FEATURE STORE [https://www.sysml.cc/doc/2019/demo_7.pdf](https://www.sysml.cc/doc/2019/demo_7.pdf) * [Feature store: A Data Management Layer for Machine Learning](https://archive.fosdem.org/2019/schedule/event/feature_store/) * a feature can be as simple as the value of a column in a database entry, or it can be a complex value that is computed from diverse sources. * Feature compression * [Embeddings@Twitter](https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html) As ML engineers continue to refine and improve a model, the number of input features may grow to such a size that online inference slows to the point where adding more features is intractable. Precomputed embeddings offer a solution if some of the features are known prior to inference time and can be grouped together and passed through one or more layers of the network. This “embedding subnetwork” can be run in batch offline or in an online prediction service. The returned output embeddings will then be stored for online inference, where latencies will be much faster than if the full network had to be evaluated. * Data Catalog * Data governance, Data Lineage, memoization * Apache Atlas * Netflix Metaflow, Lyft Flyte * 「中台」 * Memoization (Lyft) / task code/data snapshot (Netflix) ###### tags: `2019-minicourse-submarine`, `Machine Learning'