Lecture 7: Machine learning framework

# Lecture 7: Machine learning framework Part of mini-course of [Apache Submarine: Design and Implementation of a Machine Learning Platform](https://hackmd.io/@submarine/B17x8LhAH). Day 3, [Lecture 7](https://cloudera2008-my.sharepoint.com/:p:/g/personal/weichiu_cloudera2008_onmicrosoft_com/EXOLsgBqvOlLkjb12xzEzCcBMiUmkuLq7Imd7j73zBsGuA?e=LsMnDE) * 1 hr * ML Platform → ML Framework → ML Algorithms * ML Platform = supports one or multiple ML framework + toolkit to support ML workflow * ML Framework/ ML library = an implementation of ML algorithms, supports one or more ML algorithms, supports one or more languages * Distributed ML Frameworks: XGBoost v.s. Spark v.s. H2O * [Tianqi Chen - XGBoost: Overview and Latest News - LA Meetup Talk ](https://speakerdeck.com/datasciencela/tianqi-chen-xgboost-overview-and-latest-news-la-meetup-talk?slide=12) * Gradient Tree Boosting / Gradient Boosting Machine * [https://homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html](https://homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html) * ML Framework: Mahout / Spark MlLib / H2O / Scikit learn / libSVM * DL Framework: TensorFlow / MXNet / PyTorch / Keras / MLPack / MATLAB / Mathematica / Deeplearning4j / BigDL / The Microsoft Cognitive Toolkit [CNTK](https://github.com/Microsoft/CNTK#disclaimer) (inactive) / Caffe (inactive) * [Comparison of deep-learning software ](https://en.wikipedia.org/wiki/Comparison_of_deep-learning_software) * Tensorflow * [TensorFlow: A System for Large-Scale Machine Learning (publications)](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf) * [TensorFlow: A System for Learning-Scale Machine Learning (slides)](http://www.cs.cornell.edu/courses/cs6453/2017sp/slides/tensorflow.pdf) * Parameter server * More on this in Lecture 9. * Model Warm starting * Mode quality vs model freshness. * Transfer learning * For developing a new version of an existing model. Initialize the parameters from the previous version of model. Much quicker to converge. * XGBoost [Kaggle survey](https://www.kaggle.com/kaggle-survey-2019) ML popularity * File format * Human readable: JSON, XML, CSV * Binary: Avro, Parquet, ORC, CarbonData * Serialization: Thrift, ProtoBuffer * References: Chapter 4, Designing Data-Intensive Applications * In-memory format * Cost for serialization/conversion is very high for data intensive systems. * In-memory representation of data frames are not compatible. ( * Apache Arrow * Columnar * Data locality / Cache friendly * SIMD Vectorization is possible via AVX-512 instruction set. * Open standard, common format * [Arrow Columnar Format](https://arrow.apache.org/docs/format/Columnar.html#format-columnar) * Dramatic speedup in PySpark performance by 13x when Arrow is used for data frame to Pandas conversion. * [Development update: High speed Apache Parquet in Python with Apache Arrow](https://wesmckinney.com/blog/python-parquet-update/) - PyArrow’s Parquet library is much faster than Fastparquet, which is A Python interface to the Parquet file format (reading a file and convert to pandas data frame) * ONNX * Open file format for deep learning models. Supported by many DN frameworks and tools. * Tensorflow, PyTorch, MXNet … * A model may be trained by one framework, convert to ONNX format and then served by another framework. * Pre-trained models in ONNX format [onnx/models](https://github.com/onnx/models) Distributed ML framework Apache Mahout → Apache Spark MlLib ML requires many iterations, and Hadoop MapReduce is intrinsically slow. While DL is often a clear winner in computer vision and voice recognition, traditional ML techniques are still useful for most ML jobs. Tensorflow: [https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf) ![](https://lh5.googleusercontent.com/vb8Cvzw4AUYO4GtvQnxcZpLyVlw_NcRGHeZD5LFtxreZDZEu5DjH9GmUcqzoUUXhPEBSVHgUNHe15_95guZcjVWwv4_s3BzWKN6im-Uo58q0h_FUTuE1q5Ak8OcJpmAMU1GQ66__) <table> <tr> <td> </td> <td> </td> </tr> <tr> <td>Nguyen, G., Dlugolinsky, S., Bobák, M., Tran, V., García, Á. L., Heredia, I., ... & Hluchý, L. (2019). Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey. <em>Artificial Intelligence Review</em>, <em>52</em>(1), 77-124. </td> <td> </td> </tr> <tr> <td><p style="text-align: right"> ISO 690</p> </td> <td> </td> </tr> </table> ###### tags: `2019-minicourse-submarine` `Machine Learning`