# 2024.09.06 Proposal
## ML-Pipeline
### Error Handling and Logging
* Utilize airflow ui
* Try Except
* Print the error message -> airflow will log this message automatically.
* raise the exception to make the task failed
* Airflow provides built-in alert, retry, error_call back mechanism
### Todo
* Airflow alert:
* 寄mail或slack提醒
* 定期清理log:
* 寫一個dag定期去清理?
* To be surveyed
## Model Versioning
### Cython
* 有關讀不到 .so file的問題
* 編譯時的python版本跟實際上執行時的python版本要一模一樣才讀的到
* nested cythonized function無法被pickle給serialize
* 
* 目前看到的官法解法是把所有function都改成global function,但這對我們來說不太可行 [連結](https://github.com/cython/cython/issues/2619)
* cython可能還是比較適合在地端運行,如果是spark submit這種機制的話(尤其需要被serialized)就沒辦法
## Model Monitoring
* how to set up Promtail & Loki ?
* In the MLOPS project, we deploy our component in k8s, so I deploy Promtail and Loki in k8s with helm chart.
```bash=
helm upgrade --install loki grafana/loki-stack -f ../yamls/loki-stack-values.yaml -n default --set loki.image.tag=2.9.3
```
note: suppose we don't set the image tag, grafana can't add Loki as datasource.
* what's the internal mechanism and its useful features ?
* Promtail correct log by user's setting (In our case, I set this by `loki-stack-values.yaml`) then store logs back to Loki.
* Since this Protail/Loki stack is developed by grafana Lab, we can easliy configure the grafana panel when add Loki to datasource
* How does Promtail collect logs from all services deployed in each containers ?
* As mentioned above, we confiugre Promtail to let it know where to collect logs
* We will give Promtail some permission to let it watch/list pods in the node.
**loki-stack-values.yaml**
```yaml=
loki:
enabled: true
service:
type: NodePort
nodePort: 30100 # You can choose any port between 30000-32767
persistence:
enabled: true
size: 10Gi # Adjust size as needed
memberlist:
join_members:
- loki-memberlist.default.svc.cluster.local
# config:
# memberlist:
# join_members:
# - loki-memberlist
promtail:
enabled: true
config:
snippets:
extraScrapeConfigs: |
- job_name: kubernetes-pods-specific-services
kubernetes_sd_configs:
- role: pod
relabel_configs:
# - source_labels: [__meta_kubernetes_pod_label_app, __meta_kubernetes_pod_label_app_kubernetes_io_name]
# regex: (alert-server|hadoop-monitor|metric-collector)
# action: keep
- source_labels: [__meta_kubernetes_pod_label_app, __meta_kubernetes_pod_label_app_kubernetes_io_name]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
extraArgs:
- -config.expand-env=true
rbac: # give Promtail premission
create: true
pspEnabled: false
extraRules:
- apiGroups: [""]
resources: ["pods", "nodes", "nodes/proxy"]
verbs: ["get", "list", "watch"]
grafana:
enabled: false # since we already have grafana
```
## Stats API
Refactored code
`hadoop-services/data_service/src/routers/query.py`
Added
`hadoop-services/data_service/src/task/query/stats_feature_v2.py`
* `QueryStatsFeatureBaseV2`
* `QueryStatsFeatureV2`
* `QueryStatsFeatureGroupV2`