# 設定 Alertmanager 發告警 ## 環境準備 * 在 K8s 已安裝好 Prometheus 和 Alertmanager ``` $ kubectl -n monitoring-system get pod NAME READY STATUS RESTARTS AGE alertmanager-monitoring-kube-prometheus-alertmanager-0 2/2 Running 0 6d23h monitoring-grafana-85854b44d7-vtnwg 3/3 Running 0 102m monitoring-kube-prometheus-operator-9d59747fc-8shb7 1/1 Running 0 85m monitoring-kube-state-metrics-749954766b-988dl 1/1 Running 0 102m monitoring-prometheus-node-exporter-j8sdd 1/1 Running 0 102m monitoring-prometheus-node-exporter-rt2xw 1/1 Running 0 102m monitoring-prometheus-node-exporter-s47jr 1/1 Running 0 102m prometheus-monitoring-kube-prometheus-prometheus-0 2/2 Running 0 27m ``` ## Alertmanager 透過 webhook 執行 shell script ### 部署 webhook ``` $ mkdir webhook ``` ``` $ nano webhook/hooks.json [ { "id": "hello-world", "execute-command": "/usr/local/bin/test.sh", "command-working-directory": "/usr/local/bin/" } ] ``` ``` $ nano webhook/test.sh #!/usr/bin/env bash echo "hello world!!!" echo "test-$(date +%Y%m%d-%H%M)" >> /tmp/test.txt $ chmod +x webhook/test.sh ``` ``` $ nano webhook/Dockerfile FROM docker.io/library/golang:alpine AS build MAINTAINER Andy Wu WORKDIR /go/src/github.com/adnanh/webhook ENV WEBHOOK_VERSION 2.8.2 RUN apk add --update -t build-deps curl libc-dev gcc libgcc RUN curl -L --silent -o webhook.tar.gz https://github.com/adnanh/webhook/archive/${WEBHOOK_VERSION}.tar.gz && \ tar -xzf webhook.tar.gz --strip 1 RUN go get -d -v RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /usr/local/bin/webhook FROM docker.io/library/bash COPY --from=build /usr/local/bin/webhook /usr/local/bin/webhook COPY hooks.json /etc/webhook/hooks.json COPY test.sh /usr/local/bin WORKDIR /etc/webhook VOLUME ["/etc/webhook"] EXPOSE 9000 ENTRYPOINT ["/usr/local/bin/webhook"] CMD ["-verbose", "-hooks=/etc/webhook/hooks.json", "-hotreload"] ``` * build 出 webhook image ``` $ sudo podman build --squash-all -t docker.io/taiwanese/webhook:2.8.2 webhook ``` * 執行 webhook container ``` $ sudo podman run -d -p 9000:9000 --name=webhook docker.io/taiwanese/webhook:2.8.2 ``` * 測試連接 webhook ``` $ curl http://10.10.7.43:9000/hooks/hello-world ``` ### 設定 Alertmanager * 更新 `value.yaml` - 預設情況下,建立出來的 AlertmanagerConfig 會自動新增上 `namespace="xxx"` 的 Matcher,這是由於 Alertmanager 的 `alertmanagerConfigMatcherStrategy` 設定預設為 `OnNamespace`。如果需要一個 AlertmanagerConfig 可以警告所有 namespace,那麼需要修改 `alertmanagerConfigMatcherStrategy` 的配置。 ``` $ nano prometheus.yaml kubelet: enabled: true serviceMonitor: cAdvisorMetricRelabelings: [] alertmanager: alertmanagerSpec: alertmanagerConfigMatcherStrategy: type: None alertmanagerConfigSelector: matchLabels: resource: prometheus ``` ``` $ helm upgrade monitoring prometheus-community/kube-prometheus-stack --namespace monitoring-system -f prometheus.yaml ``` * 設定 prometheusrules - 此規則是當有 pod 發生 `CrashLoopBackOff` 就會觸發這個 alert - 當觸發這個 alert 時會貼上 `podcrash: "true"` 和 `severity: "warning"` 的 label ``` $ nano prometheusrules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: pod-crash-rules namespace: monitoring-system labels: release: monitoring spec: groups: - name: PodCrash-test rules: - alert: PodCrashLoop-test expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0 for: 1m labels: severity: "warning" podcrash: "true" annotations: summary: "Pod is crashing repeatedly (CrashLoopBackOff)" description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is in CrashLoopBackOff for more than 1 minute." ``` ``` $ kubectl apply -f prometheusrules.yaml $ kubectl -n monitoring-system get prometheusrules pod-crash-rules NAME AGE pod-crash-rules 11s ``` * 進到 prometheusrules UI 搜尋 `PodCrashLoop-test` alert 名稱可以找到我們剛剛創建的 alert。 ![image](https://hackmd.io/_uploads/B1i-Qbwvee.png) * 編寫 AlertmanagerConfig - `routes` 用來設定哪些告警應該送到哪個 receiver,透過指定的 `podcrash=true` 標籤,如果有符合就送到 receiver - `receivers` 定義將觸發的 alert 送到 webhook ``` $ nano AlertmanagerConfig.yaml apiVersion: monitoring.coreos.com/v1alpha1 kind: AlertmanagerConfig metadata: name: alert-webhook-config labels: resource: prometheus namespace: monitoring-system spec: route: groupBy: ["severity"] receiver: "null" groupWait: 30s groupInterval: 5m repeatInterval: 12h routes: - matchers: - name: "podcrash" value: "true" receiver: 'webhook' receivers: - name: "null" - name: "webhook" webhookConfigs: - url: "http://10.10.7.43:9000/hooks/hello-world" ``` ``` $ kubectl apply -f AlertmanagerConfig.yaml $ kubectl -n monitoring-system get AlertmanagerConfig NAME AGE alert-webhook-config 13s ``` * 檢查 alertmanager pod 設定檔是否符合預期 ``` $ kubectl -n monitoring-system exec alertmanager-monitoring-kube-prometheus-alertmanager-0 -- cat /etc/alertmanager/config_out/alertmanager.env.yaml ...... receivers: - name: "null" - name: monitoring-system/alert-config/webhook webhook_configs: - url: http://10.10.7.43:9000/hooks/hello-world templates: - /etc/alertmanager/config/*.tmpl ``` ## 測試 * 建立一個持續 crash 的 pod ``` $ kubectl create ns test $ kubectl -n test create deploy crash --image=quay.io/hahappyman/myapp $ kubectl -n test get po NAME READY STATUS RESTARTS AGE crash-669c894db5-xrrbx 0/1 CrashLoopBackOff 1 (108s ago) <invalid> ``` * 在 prometheusrules UI 可以看到目前有哪個 pod 觸發了告警 ![image](https://hackmd.io/_uploads/B1Jtmzwvxg.png) * 在 Alertmanager UI 可以看到目前有哪些 alert 正在發出告警到 webhook ![image](https://hackmd.io/_uploads/BkRvD4Dwex.png) * 到 webhook container 檢查有執行了 script ``` $ sudo podman exec webhook cat /tmp/test.txt test-20250730-0403 ``` ## Alertmanager 寄出 alert 信到 gmail ### 使用 Gmail 的 SMTP 發送信件,取得 Google 應用程式密碼 1. 登入 Gmail 後,點選右上角的頭像,再點選下方的「管理你的 Google 帳戶」 ![image](https://hackmd.io/_uploads/Bk-iTNDweg.png) 2. 選擇左側的「安全性」選單,確認「兩步驟驗證」有開啟 ![image](https://hackmd.io/_uploads/S1_QnzDDgg.png) 3. 到最上方的搜尋框內輸入:「應用」,選擇下方出現的 「應用程式密碼」 ![image](https://hackmd.io/_uploads/ryt82GDwxg.png) 4. 登入後,輸入「官網SMTP」後建立(可自行取名,但不要含特殊符號) ![image](https://hackmd.io/_uploads/HJRY2GDvlg.png) 5. 將密碼複製下來,創建 `alertmanager-gmail-auth` secret ``` $ kubectl create secret generic alertmanager-gmail-auth \ --from-literal=password='xxxx xxxx xxxx xxxx' \ -n monitoring-system ``` 6. 創建 AlertmanagerConfig,這邊沿用上一個 alert,此規則會發送告警信到指定的信箱 ``` $ nano AlertmanagerConfig.yaml apiVersion: monitoring.coreos.com/v1alpha1 kind: AlertmanagerConfig metadata: name: alert-gmail-config labels: resource: prometheus namespace: monitoring-system spec: route: groupBy: ["severity"] receiver: "null" groupWait: 30s groupInterval: 5m repeatInterval: 12h routes: - matchers: - name: "podcrash" value: "true" receiver: "gmail" receivers: - name: "null" - name: "gmail" emailConfigs: - to: "your.email@gmail.com" # 目的地信箱 from: "your.email@gmail.com" # 取得應用程式密碼的 gmail 信箱 smarthost: "smtp.gmail.com:587" authUsername: "your.email@gmail.com" # 取得應用程式密碼的 gmail 信箱 authPassword: name: alertmanager-gmail-auth key: password requireTLS: true headers: - key: Subject value: "Prometheus Mail Alerts" ``` ``` $ kubectl apply -f AlertmanagerConfig.yaml $ kubectl -n monitoring-system get AlertmanagerConfig NAME AGE alert-gmail-config 11s ``` * 檢查 alertmanager pod 設定檔是否符合預期 ``` $ kubectl -n monitoring-system exec alertmanager-monitoring-kube-prometheus-alertmanager-0 -- cat /etc/alertmanager/config_out/alertmanager.env.yaml ``` ### 驗證 * 驗證可以收到 alert 信,並且知道是哪個 pod 在哪個 namespace 有 crash * 並且寄信者就是在 Gmail 申請 SMTP 的用戶 ![image](https://hackmd.io/_uploads/rJBmjNPwex.png) ## 參考 https://github.com/adnanh/webhook https://ithelp.ithome.com.tw/articles/10364253 https://www.ibest.com.tw/news-detail/gmail-smtp/