# 2024.04.12 Proposal
[TOC]
## ML Pipeline
* Backend database MYSQL vs **PostgreSQL**
* 80% of airflow developer use Postgresql, and the amount is increasing every year.

* https://airflow.apache.org/blog/airflow-survey-2022/#which-metadata-database-do-you-use-single-choice
* Airflow Pipeline overview:
* Each component (deepdata, mlflow, alert server...) has its own airflow dag.
* Dags combined together using **TriggerDagRunOperator** into an overall DAG.
* To pass large data like training data, model between task, use remote storage(s3, hdfs, kafka...) to cache data. [REF](https://ithelp.ithome.com.tw/articles/10329530)
* We will first try to use our own HDFS. Not sure whether it can support production. (Others use s3 the most).
* In deepdata, shorten each pipeline to reduce overhead in uploading/downloading to remote storage.
* Combine data-preprocess and data-split into one data-preprocessing task
* Per discussion, are there any more practical examples of applications rather than just independent function call ?
* https://medium.com/thefork/a-guide-to-mlops-with-airflow-and-mlflow-e19a82901f88
* https://zhuanlan.zhihu.com/p/553395304
* 如果想要有各種不同的training方式,一定只能重複建立很多的dag嗎?
* 可以在training task中的function根據不同的model做不同的training方式
```python=
def train_model(model_name):
if model_name == 'model_a':
...
elif model_name == 'model_b':
...
else:
...
dag = DAG(
'dynamic_training',
default_args=default_args,
)
train_model_task = PythonOperator(
task_id='train_model',
python_callable=train_model,
op_kwargs={'model_name': '{{ dag_run.conf["model_name"] }}'}, # Pass model_name parameter from DAG Run configuration
dag=dag,
)
```
* 如果task之間call的code不是單純的function code,是一個大class member下的function,並且有共同維護的class variables,不同task間有辦法維護這些public/private class variable嗎?(可以維護多少?)
* 可以,但是不太確定什麼情境下會需要這樣做?是有class的instance需要在不同task間共用嗎?這樣應該同樣使用database維護應該比較合理?
* Question:
* Will the data-preprocessing code be provided?
## Data Analyzer / Data & Statistics Dashboard
[My Notion: Evidently](https://dasbd72.notion.site/Data-Analysis-Tool-1d5d31d68ada4525adf6162b333ab360)
### Q&A
- The planned analysis method for detecting data and remediation
1. Anomaly data
2. Data drift
- The term "data remediation" here refers to any methods, such as parameter transformations or extractions, that could mitigate the influence of biased trend data during model training
- The planed method or direction for optimizing query performance
- More study updates on the applications of *Evidently*
### Evidently Components
Metric
- Values or graph
Report
- Combination of different metrics
- Can output json, html, or show in a Jupyter notebook
Test
- Metric with condition
Test suites
- Pack of many tests
Presets
- Pack of tests or metrics for particular usage
Snapshot
- JSON version of report or test suite
Evidently Monitor
- A workspace contains multiple projects
- A workspace is a folder
- Local file system
- HDFS
Projects
- A project contains a series of snapshots
### Issues found
- Spark mode only supported with Drift related metrics and tests
- Other functions can be used after converting the data into pandas dataframe
- Or implement our own methods
- Does not support multi-target

### Evidently data drift methods
Reference
https://www.evidentlyai.com/blog/data-drift-detection-large-datasets#how-do-tests-perform-on-real-world-data-2
Below is a general description of sensitivity, not in each case.
Kolmogorov-Smirnov (KS) test
- Sensitive in large dataset
Population Stability Index (PSI)
- Less sensitive to small dataset
- Above 0
- PSI < 0.1: no significant population change
- 0.1 ≤ PSI < 0.2: moderate population change
- PSI ≥ 0.2: significant population change
Kullback-Leibler divergence (KL)
- Does not depend on the size of a sample
- Sensitivity similar to PSI
- Infinite: Range from 0 to infinity
- Asymmetric: Cannot compare the drift size between each other
Jensen-Shannon divergence
- Similar to KL
- Finite: Range from 0 to 1
- Symmetric
Wasserstein distance (Earth-Mover Distance)
- More sensitive than PSI
- Less sensitive than KS
### Outliers
Types
Univariate Outliers
- IQR
- Z-Score
Multivariate Outliers
- Mahalanobis distance
Implementation
https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_make_custom_metric_and_test.ipynb
IQR
- Implemented evidently version of metric and test for columns
### Tests
DataDriftTestPreset
- TestShareOfDriftedColumns
- <1/3
- TestColumnDrift
- Expect no drift
DataQualityTestPreset
- TestColumnShareOfMissingValues
- <10% of missing values
- TestMostCommonValueShare
- TestNumberOfConstantColumns
- TestNumberOfDuplicatedColumns
- TestNumberOfDuplicatedRows
- TestHighlyCorrelatedColumns
DataStabilityTestPreset
- TestNumberOfRows
- TestNumberOfColumns
- TestColumnsType
- Compares types against reference
- TestColumnShareOfMissingValues
- <10% of missing values
- TestShareOfOutRangeValues
- For numerical columns
- TestShareOfOutListValues
- For categorical columns
- TestMeanInNSigmas
TestColumnQuantile
- Compare the value of q quantile of reference and current data
### About spark performance
How large is the data?
## Model Versioning
* More introduction on the application of “Model Deployment” in the MLflow platform
* 
* MLflow only offers an inference server and lacks the capability to monitor the deployment status of individual machines.
* Possible use case: a machine serving as an inference server, others can simply curl the server to obtain prediction results.
* Pros: Eliminates the need to load the model locally, every machine uses the same version of the model.
* Cons: Each machine needs to use the same version of the model
* Question: What kind of data should we save in mlflow? (metrics/parameters/tag/report...)
## Model Monitoring
* how to integrate the following data sources with Prometheus for data monitoring and alerting applications in Grafana
* mlflow -> evidently monitor?
* Since evidently can calculate score, MSE and data drift score. We currently decide monitor evidently.
* The official document of evidently give the example of how to integrate grafana with evidently. (https://github.com/evidentlyai/evidently/tree/main/examples/integrations/grafana_monitoring_service)
* Howeverm, the turtorial is based on old version of evidently.
* hadoop -> jmx exporter
* This appoach monitor the file system itself. We can monitor how many files, mkdir operations, delete operation... on the grafana.
* config.yaml
```yaml=
hostPort: localhost:36892
rules:
- pattern: ".*"
```
```bash=
java -jar jmx_prometheus_httpserver-0.20.0.jar 12345 config.yaml
```
* hadoop-env.sh-
```sh=
export HDFS_NAMENODE_OPTS="-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=36892"
```
* prometheus.yml
```yml=
- job_name: hadoop-master
static_configs:
- targets: ['10.121.251.37:12345']
```
* However, this appoach can not monitor specific column value.
* We may use atlas to achieve this monitoring
* Altas
* 
* Question
* What metrics should we monitor?