# Lecture 4: Data Platform
Part of mini-course of [Apache Submarine: Design and Implementation of a Machine Learning Platform](https://hackmd.io/@submarine/B17x8LhAH). Day 2, [Lecture 4](https://cloudera2008-my.sharepoint.com/:p:/g/personal/weichiu_cloudera2008_onmicrosoft_com/EcQ18UMXJhtIjRlmuSfUdoMBJde9o7fnN4PouknQvXVaoQ?e=wkAzL4)
* 2hr
* Typical operations involved in here.
* Any machine learning project is also a data project. Any machine learning platform is also a data platform.
* Data Collection
* Data preprocessing
* [https://www.xenonstack.com/blog/data-preparation/](https://www.xenonstack.com/blog/data-preparation/)
* Data wrangling (Data scientists)
* Also called ELT
* Dplyr (R), Pandas (Python)
* [https://tdwi.org/articles/2017/02/10/data-wrangling-and-etl-differences.aspx](https://tdwi.org/articles/2017/02/10/data-wrangling-and-etl-differences.aspx)
* **Data wrangling**, sometimes referred to as **data munging**, is the process of transforming and [mapping data](https://en.wikipedia.org/wiki/Data_mapping) from one "[raw](https://en.wikipedia.org/wiki/Raw_data)" data form into another [format](https://en.wikipedia.org/wiki/Content_format) with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A **data wrangler** is a person who performs these transformation operations.
* [http://cs.brown.edu/courses/cs100/lectures/lecture6b.pdf](http://cs.brown.edu/courses/cs100/lectures/lecture6b.pdf)
* Data visualization
* ETL (IT)
* Distributed systems
* Data increases faster than CPU speed increases.
* Distributed system scales horizontally.
* Specialized hardware such as GPUs helps with certain applications to scale vertically.
* However, the scale has a limit (compute power, and memory size), and the latest trend is to use multiple GPUs to train in a distributed manner.
*
* Data/ML Open source ecosystem
* Landscape of data analytics (1hr - 2hr)
* Hadoop, Spark, Hive, Kafka, Flink, …
* ML: Apache Mahout, Spark Mllib, H2O, SystemML Scikit-Learn, pandas
* File format
* CSV, Avro, Parquet, ORC, CarbonData, ...
* Open Neural Network Exchange ONNX format
* Apache Arrow -- in memory data format
* [https://www.jowanza.com/blog/which-hadoop-file-format-should-i-use](https://www.jowanza.com/blog/which-hadoop-file-format-should-i-use)
* Notebooks
* Jupyter, Zeppelin, Stencia, R Studio (R), PolyNote (Scala),... [https://landscape.lfai.foundation/category=notebook-environment&format=card-mode&grouping=category](https://landscape.lfai.foundation/category=notebook-environment&format=card-mode&grouping=category)
* BI, data verification tool, ETL tool
* Security, Data privacy, GDPR
###### tags: `2019-minicourse-submarine`, `Machine Learning'