# Lecture 7: Machine learning framework
Part of mini-course of [Apache Submarine: Design and Implementation of a Machine Learning Platform](https://hackmd.io/@submarine/B17x8LhAH). Day 3, [Lecture 7](https://cloudera2008-my.sharepoint.com/:p:/g/personal/weichiu_cloudera2008_onmicrosoft_com/EXOLsgBqvOlLkjb12xzEzCcBMiUmkuLq7Imd7j73zBsGuA?e=LsMnDE)
* 1 hr
* ML Platform → ML Framework → ML Algorithms
* ML Platform = supports one or multiple ML framework + toolkit to support ML workflow
* ML Framework/ ML library = an implementation of ML algorithms, supports one or more ML algorithms, supports one or more languages
* Distributed ML Frameworks: XGBoost v.s. Spark v.s. H2O
* [Tianqi Chen - XGBoost: Overview and Latest News - LA Meetup Talk
](https://speakerdeck.com/datasciencela/tianqi-chen-xgboost-overview-and-latest-news-la-meetup-talk?slide=12)
* Gradient Tree Boosting / Gradient Boosting Machine
* [https://homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html](https://homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html)
* ML Framework: Mahout / Spark MlLib / H2O / Scikit learn / libSVM
* DL Framework: TensorFlow / MXNet / PyTorch / Keras / MLPack / MATLAB / Mathematica / Deeplearning4j / BigDL / The Microsoft Cognitive Toolkit [CNTK](https://github.com/Microsoft/CNTK#disclaimer) (inactive) / Caffe (inactive)
* [Comparison of deep-learning software
](https://en.wikipedia.org/wiki/Comparison_of_deep-learning_software)
* Tensorflow
* [TensorFlow: A System for Large-Scale Machine Learning (publications)](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf)
* [TensorFlow: A System for Learning-Scale Machine Learning (slides)](http://www.cs.cornell.edu/courses/cs6453/2017sp/slides/tensorflow.pdf)
* Parameter server
* More on this in Lecture 9.
* Model Warm starting
* Mode quality vs model freshness.
* Transfer learning
* For developing a new version of an existing model. Initialize the parameters from the previous version of model. Much quicker to converge.
* XGBoost
[Kaggle survey](https://www.kaggle.com/kaggle-survey-2019)
ML popularity
* File format
* Human readable: JSON, XML, CSV
* Binary: Avro, Parquet, ORC, CarbonData
* Serialization: Thrift, ProtoBuffer
* References: Chapter 4, Designing Data-Intensive Applications
* In-memory format
* Cost for serialization/conversion is very high for data intensive systems.
* In-memory representation of data frames are not compatible. (
* Apache Arrow
* Columnar
* Data locality / Cache friendly
* SIMD Vectorization is possible via AVX-512 instruction set.
* Open standard, common format
* [Arrow Columnar Format](https://arrow.apache.org/docs/format/Columnar.html#format-columnar)
* Dramatic speedup in PySpark performance by 13x when Arrow is used for data frame to Pandas conversion.
* [Development update: High speed Apache Parquet in Python with Apache Arrow](https://wesmckinney.com/blog/python-parquet-update/) - PyArrow’s Parquet library is much faster than Fastparquet, which is A Python interface to the Parquet file format (reading a file and convert to pandas data frame)
* ONNX
* Open file format for deep learning models. Supported by many DN frameworks and tools.
* Tensorflow, PyTorch, MXNet …
* A model may be trained by one framework, convert to ONNX format and then served by another framework.
* Pre-trained models in ONNX format [onnx/models](https://github.com/onnx/models)
Distributed ML framework
Apache Mahout → Apache Spark MlLib
ML requires many iterations, and Hadoop MapReduce is intrinsically slow.
While DL is often a clear winner in computer vision and voice recognition, traditional ML techniques are still useful for most ML jobs.
Tensorflow: [https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf)

<table>
<tr>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>Nguyen, G., Dlugolinsky, S., Bobák, M., Tran, V., García, Á. L., Heredia, I., ... & Hluchý, L. (2019). Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey. <em>Artificial Intelligence Review</em>, <em>52</em>(1), 77-124.
</td>
<td>
</td>
</tr>
<tr>
<td><p style="text-align: right">
ISO 690</p>
</td>
<td>
</td>
</tr>
</table>
###### tags: `2019-minicourse-submarine` `Machine Learning`