# ASMPT ## alert server ```bash= curl -X POST -H "Content-Type: application/json" -d '{ "commonLabels": { "dag_id": "model_update", "lot_ids": 5908, "dates": "2024-08-09" } }' http://10.121.252.189:30888/api/v1/alerts curl -X POST -H "Content-Type: application/json" -d '{ "dag_id": "model_update", "run_id": "model_update_2024-08-22T09:39:37.812110Z" }' http://10.121.252.189:30888/api/v1/dags/runs curl -X POST -H "Content-Type: application/json" -d '{ "commonLabels": { "dag_id": "data_analyzer", "lot_ids": 5908, "dates": "2024-08-09" } }' http://10.121.252.189:30888/api/v1/alerts # get dagruns curl -X POST -H "Content-Type: application/json" -d '{ "dag_id": "data_analyzer", "run_id": "data_analyzer_2024-08-22T09:26:23.667200Z" }' http://10.121.252.189:30888/api/v1/dags/runs # get result http://10.121.252.189:30888/api/v1/dags/driftreport?dag_id=data_analyzer&run_id=9 ``` ## Grafana movement ```bash= curl -X GET -H "Content-Type: application/json" http://10.121.252.189:30890/list_alert >> /home/jerry2024/MLOps-ASMPT/model-monitoring/examples/templates/alerts.json bdu7ejskde5fkf ddu7ejs6my8zka curl -X POST -H "Content-Type: application/json" -d '{"alert_id": "bdu7ejskde5fkf"}' http://10.121.252.189:30890/delete_alert curl -X POST -H "Content-Type: application/json" -d @../examples/templates/Hadoop.json http://172.18.0.2:30890/modify_dashboard curl -X GET http://10.121.252.184:3000/api/search?query=Hadoop \ -H "Authorization: Bearer glsa_0YTZLoVuQ8OWcMFdVvi0ev65MtqQLElT_e94148c8" \ -H "Content-Type: application/json" curl -X GET http://10.121.252.184:3000/api/dashboards/uid/ae8da2e5-e12a-4842-b244-e59f452d661e \ -H "Authorization: Bearer glsa_0YTZLoVuQ8OWcMFdVvi0ev65MtqQLElT_e94148c8" \ -H "Content-Type: application/json" >> /home/jerry2024/MLOps-ASMPT/model-monitoring/examples/templates/Hadoop.json ``` ## Demo ### data analysis ```bash= # trigger curl -X POST -H "Content-Type: application/json" -d '[{ "labels": { "dag_id": "data_analysis", "lot_id": "ATWLOT-020123-0343-136-001", "date": "2023-02-01" } }]' http://10.121.252.194:5888/api/v1/alerts ``` ### model_evaluate ```bash= # trigger curl -X POST -H "Content-Type: application/json" -d '[{ "labels": { "dag_id": "model_evaluate" } }]' http://10.121.252.194:5888/api/v1/alerts # get dagruns curl -X POST -H "Content-Type: application/json" -d '{ "dag_id": "model_evaluate", "run_id": "1719565873" }' http://10.121.252.194:5888/api/v1/dags/runs # see whether state is success / running # get result curl -X POST -H "Content-Type: application/json" -d '{ "dag_id": "model_evaluate", "run_id": "1719548649" }' http://10.121.252.194:5888/api/v1/dags/response ``` ### collector ```bash= # list alerts curl -X GET http://10.121.252.194:5222/list_alert >> /home/jerry2024/MLOps-ASMPT/model-monitoring/examples/logs/alerts.json # list panels curl -X POST -H "Content-Type: application/json" -d '{"dashboard": "MLOPS"}' http://10.121.252.194:5222/list_panel >> /home/jerry2024/MLOps-ASMPT/model-monitoring/examples/logs/panels.json # create alert curl -X POST http://10.121.252.194:5222/create_alert \ -H "Content-Type: application/json" \ -d @/home/jerry2024/MLOps-ASMPT/model-monitoring/examples/templates/alert.json # delete alert curl -X POST -H "Content-Type: application/json" -d '{"alert_id": "bd89b397-0b5b-4b7b-9676-e9446c6354a1"}' http://10.121.252.194:5222/delete_alert # modify dashboards curl -X POST -H "Content-Type: application/json" -d @/home/jerry2024/MLOps-ASMPT/model-monitoring/examples/templates/dashboard.json http://10.121.252.194:5222/modify_dashboard # delete dashboards curl -X POST -H "Content-Type: application/json" -d '{"dashboard": "DEMO"}' http://10.121.252.194:5222/delete_dashboard ``` ## Architecture * 20cores, 16g ram, 256g disk, ubuntu 23.04 live server * 10.121.252.7 * ports: * grafana 3000 * prometheus 9090 * mlflow 8080 * http-server (mlflow exporter) 8001 * airflow 8000 ## Api References ### mlflow exporter (to update metrics) #### usage ```bash= curl -X GET http://10.121.252.7:5111/update_score ``` #### expect result ```bash= "Score updated successfully", 200 ``` ### alert service (to trigger dag) #### usage * action type could be model_evaluate, model_update and data_analysis. ```bash= curl -X POST -H "Content-Type: application/json" -d '[{ "labels": { "dag_id": "data_analysis", "lot_id": "ATWLOT-020123-0343-136-001", "date": "2023-02-01" } }]' http://10.121.252.7:5888/api/v1/alerts curl -X POST -H "Content-Type: application/json" -d '[{ "labels": { "dag_id": "model_evaluate" } }]' http://10.121.252.7:5888/api/v1/alerts curl -X POST -H "Content-Type: application/json" -d '{ "dag_id": "data_analysis", "run_id": "" }' http://10.121.252.7:5888/api/v1/dags/response curl -X POST -H "Content-Type: application/json" -d '{ "dag_id": "model_evaluate", "run_id": "1718183652" }' http://10.121.252.7:5888/api/v1/dags/runs curl -X POST -H "Content-Type: application/json" -d '{"run_id": "17174832558"}' http://10.121.252.7:5888/api/v1/dags/response/put ``` #### expect result ```bash = "Alert received successfully!", 200 ``` #### Todos * see the result of each dag * maybe need to pass more data to dag ### Collectors (control grafana) ```bash= # list alert curl -X GET http://10.121.252.7:5222/list_alert # list panels curl -X POST -H "Content-Type: application/json" -d '{"dashboard": "HDFS"}' http://10.121.252.7:5222/list_panel # delete alerts curl -X POST -H "Content-Type: application/json" -d '{"alert_id": "a4e4c499-e7c0-445e-bbf5-d7972b3154d3"}' http://10.121.252.7:5222/delete_alert # create alerts curl -X POST http://10.121.252.7:5222/create_alert \ -H "Content-Type: application/json" \ -d @alert_template.json # modify dashboards curl -X POST -H "Content-Type: application/json" -d @/home/jerry2024/MLOps-ASMPT/model-monitoring/examples/templates/dashboard.json http://10.121.252.7:5222/modify_dashboard # delete dashboards curl -X POST -H "Content-Type: application/json" -d '{"dashboard": "HDFS"}' http://10.121.252.7:5222/delete_dashboard -- test commands -- # test curl -s -X 'GET' -u admin:admin 'http://10.121.252.7:3000/api/v1/provisioning/alert-rules/export' -H 'accept: application/json' | jq --sort-keys '.groups[].rules[]' > process_memory_copy_group_rules.json curl -s -X 'GET' -u admin:admin 'http://10.121.252.7:3000/api/folders' -H 'accept: application/json' curl -s -X 'GET' -u admin:admin 'http://10.121.252.7:3000/api/search?query=HDFS' -H 'accept: application/json' // get curl -s -X 'GET' -u admin:admin 'http://10.121.252.7:3000/api/dashboards/uid/dd75c664-4c59-417f-a388-ba918a0cf820' -H 'accept: application/json' > dashboard_template2.json curl -s -X POST -u admin:admin 'http://10.121.252.7:3000/api/dashboards/db' \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ -d @dashboard_template.json ``` ## Project Deployments ### Hive inotify * code * https://github.com/kevin1010607/MLOps-ASMPT/tree/model-monitor/model-monitoring/examples/inotify * prometheus.yml (/etc/prometheus/prometheus.yml) ```bash= - job_name: 'hive-inotify' static_configs: - targets: ['10.121.240.106:9605'] ``` sudo systemctl reload prometheus bash Startall.sh ### ALERT SERVER #### build ```bash= docker build -t johnson684/alert-server:latest . docker push johnson684/alert-server:latest ``` #### Pods ```bash= kubectl delete -f alert_server.yaml kubectl apply -f alert_server.yaml ``` ### Metrics collector & mlflow-exporter * python metric-exporter.py * python collector.py --- ## Installations ### kafka totally based on the quick-start tutorial https://kafka.apache.org/quickstart ```bash= bash kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic hive_table_events ``` ### kafka-python ```bash= pip install git+https://github.com/dpkp/kafka-python.git ``` ### qemu ```bash= sudo apt-get install qemu-guest-agent systemctl start qemu-guest-agent ``` ### Docker #### install ```bash= # remove unofficial packages for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done # Add Docker's official GPG key: sudo apt-get update sudo apt-get install ca-certificates curl sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc # Add the repository to Apt sources: echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update # install the latest version sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin # verify sudo docker run hello-world ``` #### permission denied * permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/json": dial unix /var/run/docker.sock: connect: permission denied ```bash= sudo chmod 666 /var/run/docker.sock ``` ### Kind #### install ```bash= # For AMD64 / x86_64 [ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64 # For ARM64 [ $(uname -m) = aarch64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-arm64 chmod +x ./kind sudo mv ./kind /usr/local/bin/kind ``` #### kubectl ``` curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl ``` ### prometheus & grafana ```bash= sudo apt install -y prometheus prometheus-node-exporter sudo apt-get install -y apt-transport-https software-properties-common wget sudo mkdir -p /etc/apt/keyrings/ wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list sudo apt-get update sudo apt-get install -y grafana-enterprise sudo systemctl enable --now grafana-server ``` #### graphite - use to export spark metrics - add following code in prometheus.yml and then reload it - mapping file ```bash= mappings: - match: '*.*.executor.filesystem.*.*' name: spark_app_filesystem_usage labels: application: $1 executor_id: $2 fs_type: $3 qty: $4 - match: '*.*.jvm.*.*' name: spark_app_jvm_memory_usage labels: application: $1 executor_id: $2 mem_type: $3 qty: $4 - match: '*.*.executor.jvmGCTime.count' name: spark_app_jvm_gcTime_count labels: application: $1 executor_id: $2 - match: '*.*.jvm.pools.*.*' name: spark_app_jvm_memory_pools labels: application: $1 executor_id: $2 mem_type: $3 qty: $4 - match: '*.*.executor.threadpool.*' name: spark_app_executor_tasks labels: application: $1 executor_id: $2 qty: $3 - match: '*.*.BlockManager.*.*' name: spark_app_block_manager labels: application: $1 executor_id: $2 type: $3 qty: $4 - match: '*.*.DAGScheduler.*.*' name: spark_app_dag_scheduler labels: application: $1 executor_id: $2 type: $3 qty: $4 - match: '*.*.CodeGenerator.*.*' name: spark_app_code_generator labels: application: $1 executor_id: $2 type: $3 qty: $4 - match: '*.*.HiveExternalCatalog.*.*' name: spark_app_hive_external_catalog labels: application: $1 executor_id: $2 type: $3 qty: $4 - match: '*.*.*.StreamingMetrics.*.*' name: spark_app_streaming_metrics labels: application: $1 executor_id: $2 app_name: $3 type: $4 qty: $5 - match: '*.*.executor.filesystem.*.*' name: filesystem_usage labels: application: $1 executor_id: $2 fs_type: $3 qty: $4 - match: '*.*.executor.threadpool.*' name: executor_tasks labels: application: $1 executor_id: $2 qty: $3 - match: '*.*.executor.jvmGCTime.count' name: jvm_gcTime_count labels: application: $1 executor_id: $2 - match: '*.*.executor.*.*' name: executor_info labels: application: $1 executor_id: $2 type: $3 qty: $4 - match: '*.*.jvm.*.*' name: jvm_memory_usage labels: application: $1 executor_id: $2 mem_type: $3 qty: $4 - match: '*.*.jvm.pools.*.*' name: jvm_memory_pools labels: application: $1 executor_id: $2 mem_type: $3 qty: $4 - match: '*.*.BlockManager.*.*' name: block_manager labels: application: $1 executor_id: $2 type: $3 qty: $4 - match: '*.driver.DAGScheduler.*.*' name: DAG_scheduler labels: application: $1 type: $2 qty: $3 - match: '*.driver.*.*.*.*' name: task_info labels: application: $1 task: $2 type1: $3 type2: $4 qty: $5 ``` ```bash= - job_name: 'graphite_exporter' static_configs: - targets: ['http://10.121.251.37:9108/'] ``` - How to turn on? ```bash= ./graphite_exporter --graphite.mapping-config=graphite_exporter_mapping ``` ### jmx ```bash= java -jar jmx_prometheus_httpserver-0.20.0.jar 12345 config.yaml ``` * config ```yaml= hostPort: localhost:36892 rules: - pattern: ".*" ``` * hadoop-env.sh ```sh= export HDFS_NAMENODE_OPTS="-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=36892" ``` ### py-venv ```bash= sudo apt install python3-pip sudo apt install python3-venv (env) python -m ensurepip --default-pip (env) pip install -r requirements.txt ``` ## Background (tmux) ```bash= source /home/jerry2024/deepdata/env/bin/activate mlflow server --host 127.0.0.1 --port 8080 python prometheus.py airflow webserver --port 8000 airflow scheduler ``` ## airflow (in env) ```bash= export AIRFLOW_HOME=~/airflow AIRFLOW_VERSION=2.8.2 # Extract the version of Python you have installed. If you're currently using a Python version that is not supported by Airflow, you may want to set this manually. # See above for supported versions. PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)" CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt" # For example this would install 2.8.2 with python 3.8: https://raw.githubusercontent.com/apache/airflow/constraints-2.8.2/constraints-3.8.txt pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}" airflow db migrate airflow users create \ --username admin \ --firstname Peter \ --lastname Parker \ --role Admin \ --email spiderman@superhero.org # password: jerry2024 airflow webserver --port 8000 airflow scheduler # 停止airflow webserver ps -ef | grep 'airflow' | grep 'webserver' | awk '{print $2}' | xargs kill -9 cd $AIRFLOW_HOME rm -rf airflow-webserver.pid rm -rf airflow-webserver-monitor.pid # 停止airflow scheduler ps -ef | grep 'airflow' | grep 'scheduler' | awk '{print $2}' | xargs kill -9 cd $AIRFLOW_HOME rm -rf airflow-scheduler.pid ``` ## hadoop tmux ```bash= # exporter for hadoop cd /home/hdoop/jmx_exporter java -jar jmx_prometheus_httpserver-0.20.0.jar 12345 config.yaml # exporter for spark cd /usr/local/graphite_exporter ./graphite_exporter --graphite.mapping-config=graphite_exporter_mapping ## open tmux new-session -s exporters -d tmux send-keys -t exporters:0 'cd /home/hdoop/jmx_exporter' C-m tmux send-keys -t exporters:0 'java -jar jmx_prometheus_httpserver-0.20.0.jar 12345 config.yaml' C-m tmux split-window -h -t exporters tmux send-keys -t exporters:1 'cd /usr/local/graphite_exporter' C-m tmux send-keys -t exporters:1 './graphite_exporter --graphite.mapping-config=graphite_exporter_mapping' C-m ## close tmux detach -s exporters ``` ## npm ```bash= sudo apt install npm ``` ## java & spark [Notion](https://dasbd72.notion.site/Spark-Client-Installization-d2159c74379b4e45a279247de6649bc3) --- ## tmux usage [tutorial](https://andyyou.github.io/2017/11/27/tmux-notes/)