Lecture 4: Data Platform

# Lecture 4: Data Platform Part of mini-course of [Apache Submarine: Design and Implementation of a Machine Learning Platform](https://hackmd.io/@submarine/B17x8LhAH). Day 2, [Lecture 4](https://cloudera2008-my.sharepoint.com/:p:/g/personal/weichiu_cloudera2008_onmicrosoft_com/EcQ18UMXJhtIjRlmuSfUdoMBJde9o7fnN4PouknQvXVaoQ?e=wkAzL4) * 2hr * Typical operations involved in here. * Any machine learning project is also a data project. Any machine learning platform is also a data platform. * Data Collection * Data preprocessing * [https://www.xenonstack.com/blog/data-preparation/](https://www.xenonstack.com/blog/data-preparation/) * Data wrangling (Data scientists) * Also called ELT * Dplyr (R), Pandas (Python) * [https://tdwi.org/articles/2017/02/10/data-wrangling-and-etl-differences.aspx](https://tdwi.org/articles/2017/02/10/data-wrangling-and-etl-differences.aspx) * **Data wrangling**, sometimes referred to as **data munging**, is the process of transforming and [mapping data](https://en.wikipedia.org/wiki/Data_mapping) from one "[raw](https://en.wikipedia.org/wiki/Raw_data)" data form into another [format](https://en.wikipedia.org/wiki/Content_format) with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A **data wrangler** is a person who performs these transformation operations. * [http://cs.brown.edu/courses/cs100/lectures/lecture6b.pdf](http://cs.brown.edu/courses/cs100/lectures/lecture6b.pdf) * Data visualization * ETL (IT) * Distributed systems * Data increases faster than CPU speed increases. * Distributed system scales horizontally. * Specialized hardware such as GPUs helps with certain applications to scale vertically. * However, the scale has a limit (compute power, and memory size), and the latest trend is to use multiple GPUs to train in a distributed manner. * * Data/ML Open source ecosystem * Landscape of data analytics (1hr - 2hr) * Hadoop, Spark, Hive, Kafka, Flink, … * ML: Apache Mahout, Spark Mllib, H2O, SystemML Scikit-Learn, pandas * File format * CSV, Avro, Parquet, ORC, CarbonData, ... * Open Neural Network Exchange ONNX format * Apache Arrow -- in memory data format * [https://www.jowanza.com/blog/which-hadoop-file-format-should-i-use](https://www.jowanza.com/blog/which-hadoop-file-format-should-i-use) * Notebooks * Jupyter, Zeppelin, Stencia, R Studio (R), PolyNote (Scala),... [https://landscape.lfai.foundation/category=notebook-environment&format=card-mode&grouping=category](https://landscape.lfai.foundation/category=notebook-environment&format=card-mode&grouping=category) * BI, data verification tool, ETL tool * Security, Data privacy, GDPR ###### tags: `2019-minicourse-submarine`, `Machine Learning'