2024.04.12 Proposal

# 2024.04.12 Proposal [TOC] ## ML Pipeline * Backend database MYSQL vs **PostgreSQL** * 80% of airflow developer use Postgresql, and the amount is increasing every year. ![image](https://hackmd.io/_uploads/ByR4TVrgR.png) * https://airflow.apache.org/blog/airflow-survey-2022/#which-metadata-database-do-you-use-single-choice * Airflow Pipeline overview: * Each component (deepdata, mlflow, alert server...) has its own airflow dag. * Dags combined together using **TriggerDagRunOperator** into an overall DAG. * To pass large data like training data, model between task, use remote storage(s3, hdfs, kafka...) to cache data. [REF](https://ithelp.ithome.com.tw/articles/10329530) * We will first try to use our own HDFS. Not sure whether it can support production. (Others use s3 the most). * In deepdata, shorten each pipeline to reduce overhead in uploading/downloading to remote storage. * Combine data-preprocess and data-split into one data-preprocessing task * Per discussion, are there any more practical examples of applications rather than just independent function call ? * https://medium.com/thefork/a-guide-to-mlops-with-airflow-and-mlflow-e19a82901f88 * https://zhuanlan.zhihu.com/p/553395304 * 如果想要有各種不同的training方式，一定只能重複建立很多的dag嗎? * 可以在training task中的function根據不同的model做不同的training方式 ```python= def train_model(model_name): if model_name == 'model_a': ... elif model_name == 'model_b': ... else: ... dag = DAG( 'dynamic_training', default_args=default_args, ) train_model_task = PythonOperator( task_id='train_model', python_callable=train_model, op_kwargs={'model_name': '{{ dag_run.conf["model_name"] }}'}, # Pass model_name parameter from DAG Run configuration dag=dag, ) ``` * 如果task之間call的code不是單純的function code，是一個大class member下的function，並且有共同維護的class variables，不同task間有辦法維護這些public/private class variable嗎?(可以維護多少?) * 可以，但是不太確定什麼情境下會需要這樣做？是有class的instance需要在不同task間共用嗎？這樣應該同樣使用database維護應該比較合理？ * Question: * Will the data-preprocessing code be provided? ## Data Analyzer / Data & Statistics Dashboard [My Notion: Evidently](https://dasbd72.notion.site/Data-Analysis-Tool-1d5d31d68ada4525adf6162b333ab360) ### Q&A - The planned analysis method for detecting data and remediation 1. Anomaly data 2. Data drift - The term "data remediation" here refers to any methods, such as parameter transformations or extractions, that could mitigate the influence of biased trend data during model training - The planed method or direction for optimizing query performance - More study updates on the applications of *Evidently* ### Evidently Components Metric - Values or graph Report - Combination of different metrics - Can output json, html, or show in a Jupyter notebook Test - Metric with condition Test suites - Pack of many tests Presets - Pack of tests or metrics for particular usage Snapshot - JSON version of report or test suite Evidently Monitor - A workspace contains multiple projects - A workspace is a folder - Local file system - HDFS Projects - A project contains a series of snapshots ### Issues found - Spark mode only supported with Drift related metrics and tests - Other functions can be used after converting the data into pandas dataframe - Or implement our own methods - Does not support multi-target ![image](https://hackmd.io/_uploads/BJ5O9DUeA.png) ### Evidently data drift methods Reference https://www.evidentlyai.com/blog/data-drift-detection-large-datasets#how-do-tests-perform-on-real-world-data-2 Below is a general description of sensitivity, not in each case. Kolmogorov-Smirnov (KS) test - Sensitive in large dataset Population Stability Index (PSI) - Less sensitive to small dataset - Above 0 - PSI < 0.1: no significant population change - 0.1 ≤ PSI < 0.2: moderate population change - PSI ≥ 0.2: significant population change Kullback-Leibler divergence (KL) - Does not depend on the size of a sample - Sensitivity similar to PSI - Infinite: Range from 0 to infinity - Asymmetric: Cannot compare the drift size between each other Jensen-Shannon divergence - Similar to KL - Finite: Range from 0 to 1 - Symmetric Wasserstein distance (Earth-Mover Distance) - More sensitive than PSI - Less sensitive than KS ### Outliers Types Univariate Outliers - IQR - Z-Score Multivariate Outliers - Mahalanobis distance Implementation https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_make_custom_metric_and_test.ipynb IQR - Implemented evidently version of metric and test for columns ### Tests DataDriftTestPreset - TestShareOfDriftedColumns - <1/3 - TestColumnDrift - Expect no drift DataQualityTestPreset - TestColumnShareOfMissingValues - <10% of missing values - TestMostCommonValueShare - TestNumberOfConstantColumns - TestNumberOfDuplicatedColumns - TestNumberOfDuplicatedRows - TestHighlyCorrelatedColumns DataStabilityTestPreset - TestNumberOfRows - TestNumberOfColumns - TestColumnsType - Compares types against reference - TestColumnShareOfMissingValues - <10% of missing values - TestShareOfOutRangeValues - For numerical columns - TestShareOfOutListValues - For categorical columns - TestMeanInNSigmas TestColumnQuantile - Compare the value of q quantile of reference and current data ### About spark performance How large is the data? ## Model Versioning * More introduction on the application of “Model Deployment” in the MLflow platform * ![image](https://hackmd.io/_uploads/Bk98_QVl0.png) * MLflow only offers an inference server and lacks the capability to monitor the deployment status of individual machines. * Possible use case: a machine serving as an inference server, others can simply curl the server to obtain prediction results. * Pros: Eliminates the need to load the model locally, every machine uses the same version of the model. * Cons: Each machine needs to use the same version of the model * Question: What kind of data should we save in mlflow? (metrics/parameters/tag/report...) ## Model Monitoring * how to integrate the following data sources with Prometheus for data monitoring and alerting applications in Grafana * mlflow -> evidently monitor? * Since evidently can calculate score, MSE and data drift score. We currently decide monitor evidently. * The official document of evidently give the example of how to integrate grafana with evidently. (https://github.com/evidentlyai/evidently/tree/main/examples/integrations/grafana_monitoring_service) * Howeverm, the turtorial is based on old version of evidently. * hadoop -> jmx exporter * This appoach monitor the file system itself. We can monitor how many files, mkdir operations, delete operation... on the grafana. * config.yaml ```yaml= hostPort: localhost:36892 rules: - pattern: ".*" ``` ```bash= java -jar jmx_prometheus_httpserver-0.20.0.jar 12345 config.yaml ``` * hadoop-env.sh- ```sh= export HDFS_NAMENODE_OPTS="-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=36892" ``` * prometheus.yml ```yml= - job_name: hadoop-master static_configs: - targets: ['10.121.251.37:12345'] ``` * However, this appoach can not monitor specific column value. * We may use atlas to achieve this monitoring * Altas * ![image](https://hackmd.io/_uploads/Hk_na8LlA.png) * Question * What metrics should we monitor?