Paper review - HackMD

# Paper review Topic: [Practices and Infrastructures for ML Systems](https://www.techrxiv.org/articles/preprint/Practices_and_Infrastructures_for_ML_Systems_An_Interview_Study/16939192) [1] https://www.techrxiv.org/articles/preprint/Practices_and_Infrastructures_for_ML_Systems_An_Interview_Study/16939192 [2] https://towardsdatascience.com/ml-ops-machine-learning-as-an-engineering-discipline-b86ca4874a3f [3] https://oag.ca.gov/privacy/ccpa [4] https://gdpr.eu/ [5] http://ceur-ws.org/Vol-2191/paper13.pdf [6] https://www.datacamp.com/blog/machine-learning-lifecycle-explained [7] https://www.forbes.com/sites/quora/2017/01/26/is-data-more-important-than-algorithms-in-ai/ [8] https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-scientists [9] https://docs.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_ig_parquet.html ## Abstract Recent development in machine learning and artificial intelligence has enabled large-scale machine learning models that serve a huge number of users with multiple versions. Training and deploying a machine learning model is no longer a Jupyter notebook with a model deployed on a single server. It now requires a set of practices to deploy machine learning models to production and create machine learning pipelines that involves software engineers and DevOps engineers. On the other hand, machine learning is different from software engineering since "it is not just code, it is code plus data" [2]. Data management and handling are also crucial for a machine learning pipeline. Therefore, a new field called MLOps has arisen that consists of machine learning, DevOps and data engineering, which applies to the whole machine learning lifecycle: model generation, model integration, model orchestration, model deployment and model monitoring. Large corporations now requires machine learning systems to be carefully monitored and deployed with a specific infrastructure in many domains. In [1], the authors attempted to investigate machine learning workflow practices in 16 organizations and companies in Finland and answered two main questions: what the best MLOps practices are and what tools are used to enhance those procedures. ## Introduction ### Research questions Machine learning systems are not simple for big organizations because there are multiple services, tasks and pipelines such as databases, data warehouses, data lakes or data pipelines that needed to be managed [1]. There might be many data scientists and data analyst doing research at the same time that need access to a data warehouse to do a heavy query. There are also multiple training jobs created and scheduled by machine learning engineers, triggered periodically and must be managed automatically. Data infrastructure must be scalable and fault tolerance, allowing different requests from many users. On the other hand, data access and permission must be handled correctly with the right authorization. CCPA [3] in the United States and GDPR [4] in EU guarantee data protection and privacy so access to sensitive data stored in a corporation needs to be checked and monitored. With the complicacy of a machine learning system, many corporations have imposed practices and procedured to handle the system. The authors in [1] did a research on the best practices in different organizations varied in sector and size in Finland. The main two questions are: - What practices are applied in the development, deployment and maintenance of ML-enabled sofware systems? - What tools are used to support development, deployment and maintenance of ML-enabled software system? The practitioners in those corporations were interviewed in online meetings. The authors then transcribed the interviews and created a summary on the practices, challenges, tools and use cases [1]. ### DevOps, MLOps and DataOps DevOps is a set of practices that combines software engineer (Dev) and IT operations (Ops). It is the first concept that comes to the Agile method of software engineer that guarantees the stability and quality of software development cycle. With the rise of microservices architecture, DevOps becomes a crucial part in continuous delivery and continuous integration, ensures multiple services in the infrastructure highly available, scalable and consistent. MLOps is a set of practices aim to deploy machine learning models efficiently. MLOps shares some of the best practices with DevOps but applies only to machine learning. MLOps also contains some practices from data engineering in terms of maintaining and monitoring systems that allow data collection. The organizations in [1] focus mainly on MLOps but also utilize some tools from DevOps. ![](https://i.imgur.com/AXlUEPc.png) (MLOps and DevOps's relationship - [2]) DataOps is similar to MLOps but applies to data analytics [5]. However, since data analytics is not as close to software engineer as machine learning, DataOps also consists of some practices specifically only to data analytics, which aims to ensure the data lifecycle reliable and accurate. Comparable to MLOps, DataOps also borrows some methods from DevOps. ### Machine learning lifecycle There are 6 main steps in a machine learning lifecycle [6]: 1. Planning: This step requires the scientists and engineers to truly understand the problem they are trying to solve. A model should not be trained without a deep understanding on the problem because it might later turn useless. Data availability must be researched carefully to ensure the possibility of latter steps. A very famous quote in machine learning is "data is more important than algorithms" [7]. On the other hand, the resources required to train and serve the models also need to be considered carefully. Some models might demand a huge server with GPU which becomes too costly for some small organizations, thus makes the whole lifecycle infeasible. 2. Data preparation: Data collection, labeling and cleaning are crucial in any machine learning lifecycle. Data quality has a huge impact on the model. In practice, 80% of the time is spent on this step to ensure data are valid and not corrupted [8]. Feature selection and feature engineering are also performed at this step to normalize data and eliminate insignificant features to reduce the timem and cost when training a model. 3. Model engineering: This is the most interesting part for machine learning scientist in which the model architecture is built using specific machine learning algorithms. Model metrics are defined to prepare for latter phase in which model results are analyzed. Data are trained and validated using data prepared from step 2. Training results are intepreted by some domain knowledge experts or data analysts to determine whether a model is good enough for production deployment. 4. Model evaluation: The model is tested with test data and the metrics are compared with metrics when the model is trained. Real-world data is much different than training and testing data because of randomness so the model robustness should be tested with data when adding some degrees of randomness. After this step, a model is decided whether to be deployed for production. 5. Model deployment: A model is deployed to the current system. MLOps plays an important role at this phase to guarantee the model can have enough resources while serving users. Normally an A/B test is performed at this step to validate the efficacy of a model and compare the results between users using the model and users who do not use it. Sometimes, even when the model performs well in step 3 and 4, its performance in real-world so no statistically significance (step 6 will focus more on that problem). Therefore, a model should never be released completely without being validated carefully. 6. Monitoring and maintenance: The model needs to be monitored continuously with the metrics defined in step 4. Resource metrics such as CPU and memory should also be monitored carefully. If a reduced performance alert is triggered, a model needs to be reviewed. In some case, step 3 is repeated to build a new model with some improvements. ```mermaid flowchart TB Planning --> data_preparation[Data preparation] --> model_engineering[Model engineering] --> model_evaluation[Model evaluation] --> model_deployment[Model deployment] --> monitoring[Monitoring and maintenance] ``` 6 steps for a machine learing lifecycle ## MLOps practices ### Practices applied in ML-enabled sofware systems The authors in [1] have pointed out some practices in some Finnish organizations. Data are normally stored in cloud platforms, mainly Google Cloud Platform (Cloud Storage), AWS (S3) or Azure Cloud. Google Cloud Storage organizes data in hierarchy in which files are arranged into folders like the normal file systems we are using. AWS S3 is a scalale storage which provides an interface as a hierarchy architecture but stores data as flat objects. Data format is mainly Apache Parquet but can be in some basic file formats (CSV, TSV or JSON). The trade-off comes between scalability of data pipelines, readability and the ease of accessing data. Apache Parquet offers column-wise compression and supports efficient queries, which is more scalable than normal file formats [9]. Data validation and labeling are usually done using custom in-house tools with human actors. Model training takes advantage of transfer learning because to save computing resources and utilizes bigger datasets. However, if there is sufficient data and resources to train to convergence, neural networks are normally trained without transfer learning. The most popular frameworks for deep learning (Keras, Tensorflow and PyTorch) are used along with scikit-learn, a Python library/toolkit for training machine learning. Model evaluation is also a key step with dedicated tracking tools and metadata logging. Infrastructure monitoring ensures models to utilize sufficient resources (GPU/CPU, memory, disk) and to detect technical problems. ### Tools in ML-enabled software system - Version management: Git, GitLab, BitBucket - Infrastructure as code: Terraform - ML training workflow: mainly using a container platform (docker) - Kubernetes (k8s) for container orchestration. - AWS SageMaker for model training. - Apache Airflow for schedule model training based on triggers - MLflow or Tensorboard to evaluate model performance. - CI/CD: - Jenkin to run test and build docker images based on workflow - Model repository: AWS S3, PostgreSQL (how do you store a model in PostgreSQL - my question) - Model deployment: using REST or gRPC: - KFServing and Seldon - Monitorning: - Grafana and Prometheus for metrics and dashboards. - Elasticsearch and BigQuery for logs. - AWS CloudWatch. ### Cost ## Tool review