Prometheus - HackMD

--- title: Prometheus # 簡報的名稱 tags: Prometheus # 簡報的標籤 --- # Prometheus > [name=翁維甫] > [time=Thur, Jul 23, 2020 4:00 PM] --- # Agenda * Prometheus 介紹 * Prometheus 建立 * Dashboard * Alert 系統 --- ## Prometheus 介紹 --- ### 發展史 ![](https://i.imgur.com/5yOuP7j.png) --- ### 特性 * 多維度資料模型 * 時間序列資料透過 Metric 名稱與 Key-value 來區分 * 所有 Metrics 可以設定任意的多維標籤 * 資料模型彈性度高，不需要刻意設定為以特定符號(ex: ,)分割 * 可對資料模型進行聚合、切割與切片操作 * 支援雙精度浮點數類型，標籤可以設定成 Unicode --- * 靈活的查詢語言(PromQL)，可進行加減乘除等 * 不依賴分散式儲存，因為 Prometheus Server 是一個二進位檔，可在單個服務節點自主運行 * 透過 HTTP 的 Pull 方式收集時序資料 * 可以透過 Push Gateway 進行資料推送 * 支援多種視覺化儀表板呈現，如 Grafana * 能透過"**動態服務發現(Service discovery)**"或"**靜態文件配置**"去獲取監控的 Targets --- ### 架構圖 ![](https://i.imgur.com/xUwWk2v.png) --- ### Prometheus VS InfluxDB * InfluxDB：僅僅是一個資料庫，它被動的接受客戶端資料和查詢請求，基於 Push * Prometheus：完整的監控系統，能抓取資料、查詢資料、告警等功能，基於 Pull * Push 和 Pull 主要區別在發起者不同及邏輯架構不同 --- ### Push、Pull ![](https://i.imgur.com/TzAp3Ug.png) --- ### Metric Type * Counter（計數器） * Gauge（儀表板） * Histogram（直方圖） * Summary（摘要） --- ### 監控系統 ![](https://i.imgur.com/jR6O6m6.png) --- ## Prometheus 建立 --- ### 安裝 Docker ```linux= sudo apt-get install docker.io ``` ### 安裝 Docker-compose ```linux= sudo curl -L "https://github.com/docker/compose/releases/download/1.25.4/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose ``` --- ### 編寫 Docker-compose 的 yaml ```linux= vim xxxx.yaml #文件名稱自定義 ``` ```yaml= version: '2' networks: #指定網路 service_net: driver: bridge #自訂名稱 ipam: #自訂靜態 IP config: - subnet: 172.22.238.0/24 #CIDR 格式子網路 gateway: 172.22.238.1 services: #要啟動的服務 prometheus: #新增服務 Prometheus image: prom/prometheus #Docker Hub Image hostname: prometheus #容器內部 hostname restart: always #容器自動啟動 volumes: #要從本地資料夾掛載進去的資料 - /data/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - /data/prometheus/rules.yml:/etc/prometheus/rules.yml - /data/prometheus/prometheus-data:/prometheus command: - '--web.enable-lifecycle' #啟用 reload - '--config.file=/etc/prometheus/prometheus.yml' #指定容器中的配置文件 ports: #將容器的 port 映射出來 (vm-port:container-port) - '9090:9090' networks: service_net: ipv4_address: 172.22.238.10 user: 0:0 #root 權限 alertmanager: #新增服務 Alertmanager image: prom/alertmanager hostname: alertmanager restart: always volumes: - /data/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml command: - '--config.file=/etc/alertmanager/alertmanager.yml' ports: - "9093:9093" networks: service_net: ipv4_address: 172.22.238.11 grafana: #新增服務 Grafana image: grafana/grafana hostname: grafana restart: always environment: #用環境變量來安裝套件 - GF_INSTALL_PLUGINS=grafana-piechart-panel ports: - "3000:3000" networks: service_net: ipv4_address: 172.22.238.12 node-exporter: #新增服務 Node-exporter image: quay.io/prometheus/node-exporter hostname: node-exporter restart: always ports: - "9100:9100" networks: service_net: ipv4_address: 172.22.238.13 cadvisor: #新增服務 Cadvisor image: google/cadvisor:latest hostname: cadvisor restart: always volumes: - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro ports: - "8080:8080" networks: service_net: ipv4_address: 172.22.238.14 black-exporter: #新增服務 Black-exporter image: prom/blackbox-exporter:v0.17.0 hostname: black-exporter restart: always volumes: - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro - /data/blackbox/blackbox.yml:/config/blackbox.yml command: - '--config.file=/config/blackbox.yml' ports: - '9115:9115' networks: service_net: ipv4_address: 172.22.238.15 ``` --- ### Prometheus 設定檔建立資料夾 ```linux= mkdir -p /data/prometheus ``` 編輯設定檔 ```linux= vim /data/prometheus/prometheus.yml ``` ```yaml= global: scrape_interval: 15s # 預設 scrape 的拉取間隔時間 evaluation_interval: 15s # 規則掃描時間間隔 alerting: alertmanagers: - static_configs: - targets: - 10.X.X.X:9093 rule_files: - "rules.yml" scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['prometheus:9090'] - job_name: 'cadvisor' static_configs: - targets: ['cadvisor:8080'] labels: instance: cadvisor - job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100'] labels: instance: node - job_name: 'black-exporter' scrape_interval: 10s static_configs: - targets: ['black-exporter:9115'] labels: instance: black_box - job_name: 'icmp' scrape_interval: 5s metrics_path: /probe params: module: [icmp] static_configs: - targets: ['10.140.20.230','10.140.20.232'] labels: group: 'QC+RD' relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: ping - target_label: __address__ replacement: 10.X.X.X:9115 - job_name: 'http' metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - kennyweng.github.io relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 10.X.X.X:9115 ``` --- ### Blackbox 設定檔建立資料夾 ```linux= mkdir -p /data/blackbox ``` 編輯設定檔 ```linux= vim /data/blackbox/blackbox.yml ``` ```yaml= modules: http_2xx: #http 監控模組 prober: http http_post_2xx: #http post 監控模組 prober: http http: method: POST tcp_connect: #tcp 監控模組 prober: tcp pop3s_banner: prober: tcp tcp: query_response: - expect: "^+OK" tls: true tls_config: insecure_skip_verify: false ssh_banner: prober: tcp tcp: query_response: - expect: "^SSH-2.0-" irc_banner: prober: tcp tcp: query_response: - send: "NICK prober" - send: "USER prober prober prober :prober" - expect: "PING :([^ ]+)" send: "PONG ${1}" - expect: "^:[^ ]+ 001" icmp: #icmp 檢測模組 prober: icmp ``` --- ### Alert_rules 設定檔編輯設定檔 ```linux= vim /data/prometheus/rules.yml ``` ```yaml= groups: - name: probe_icmp_duration rules: - alert: icmp #告警規則名稱 expr: probe_icmp_duration_seconds > 1 #PromQL 表達式告警觸發條件 ( ping 回傳超過 1ms ) for: 10s #評估等待時間，觸發條件後持續一段時間後才發送告警，等待期間狀態為 pending labels: group: 'QC+RD' annotations: #告警註解 summary: "ping is too high" - name: check_ssl_status rules: - alert: "ssl 過期 " expr: ceil((probe_ssl_earliest_cert_expiry - time())/86400) <600 #憑證剩餘時間 for: 1m labels: severity: warn annotations: value: '{{$value}}' summary: 'ssl 即將到期' description: 'kennyweng.github.io 剩 {{$value}} 天就過期了, 請盡快更換憑證' - name: blackbox_network_stats rules: - alert: blackbox_network_stats expr: probe_success == 0 for: 1m labels: severity: critical annotations: summary: 'kennyweng.github.io 無法連線' description: '請盡快查詢' ``` --- ### Alertmanager 設定檔建立資料夾 ```linux= mkdir -p /data/alertmanager ``` 編輯設定檔 ```linux= vim /data/alertmanager/alertmanager.yml ``` ```yaml= global: resolve_timeout: 1m #處理超時時間，預設為5m route: group_by: ['alertname'] receiver: 'slack' group_wait: 5s #收發訊息等待時間 group_interval: 10s #相同 Group 發送告警時間間隔 repeat_interval: 30m #告警時間間隔 receivers: - name: 'slack' slack_configs: - api_url: 'https://hooks.slack.com/services/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' channel: '#kibana' #頻道 send_resolved: true #告警恢復通知 text: '救命啊' #告警訊息 title_link: 'http://10.140.20.54:9093/#/alerts' #告警連結 ``` --- ### 啟動服務啟動 Docker-compose ```linux= docker-compose -f /home/08admin/dc.yaml up -d ``` ![](https://i.imgur.com/p4UYbRA.png) 看一下狀態 ```linux= docker ps -a ``` ![](https://i.imgur.com/sZcDIjI.png) --- ### Acceptance 瀏覽器輸入 IP 帶上 9090 Port 即可進入 Prometheus web UI ![](https://i.imgur.com/zWRchep.png) --- ## Dashboard 瀏覽器輸入 IP 帶上 3000 Port 登入 Grafana ![](https://i.imgur.com/c6yVMoz.png) --- ### 新增 Data Sources ![](https://i.imgur.com/tt6oN4Q.png) --- ## Alert 系統 --- ### 告警示意圖 ![](https://i.imgur.com/mghYTXw.png) --- ### 告警狀態 * Inactive * 沒有觸發閾值 * Pending * 已觸發閾值但未滿足告警持續時間 * Firing * 已觸發閾值且滿足告警持續時間 --- ### 告警範例 ```yaml= groups: - name: probe_icmp_duration rules: - alert: icmp expr: probe_icmp_duration_seconds > 1 for: 10s labels: group: 'QC+RD' annotations: summary: "ping is too high" ``` ![](https://i.imgur.com/Fq7DCJr.png) --- ### Alert Manager 特性 ![](https://i.imgur.com/H6DCVX9.png) --- ### 發送端、接收端流程圖 ![](https://i.imgur.com/g4Md2oE.png) --- ### 告警收斂流程圖 ![](https://i.imgur.com/4rUhDrx.png) --- ### 告警收斂 * Group 分組 * 整合同類的告警，幫助維運單位排查問題 * 通過告警郵件、訊息的合併，減少告警數量 * Inhibition 抑制 * 消除冗餘的告警 * 高級別告警抑制低級別告警 * Silences 靜默 * 阻止發送可預期的告警 * 確保處理期間不會收到重複的告警 --- ### 抑制機制 ```yaml= inhibit_rules: - source_match: alertname: NodeDown severity: 'critical' target_match: severity: 'warning' #若告警名稱相同，critical 級別的告警會抑制 warning 級別的告警 equal: ['alertname'] ``` --- ### 靜默機制 ![](https://i.imgur.com/02PQNdg.png) --- ### 告警延時流程圖 ![](https://i.imgur.com/cVFWjSL.png) --- ### 告警延時 ```yaml= group_wait: 5s #分組等待時間 group_interval: 5m #分組嘗試再次發送告警的時間間隔 repeat_interval: 60m #分組內發送相同告警的時間間隔 ``` ![](https://i.imgur.com/3b6dTKW.png) ![](https://i.imgur.com/CAfz7ay.png) --- ### 告警範例 ```yaml= groups: - name: ServiceStatus rules: - alert: prometheus down #Prometheus 死亡 expr: prometheus_config_last_reload_successful != 1 #Prometheus 最後一次載入設定檔失敗 for: 1m labels: name: prometheus severity: error annotations: summary: "prometheus down (instance {{ $labels.instance }})" description: "prometheus instance is down" value: "{{ $value }}" - alert: alertmanager down #Alertmanager 死亡 expr: alertmanager_config_last_reload_successful != 1 #Alertmanager 最後一次載入設定檔失敗 for: 1m labels: name: alertmanager severity: error annotations: summary: "alertmanager down (instance {{ $labels.instance }})" description: "alertmanager instance is down" value: "{{ $value }}" - alert: cpu usege load #cpu 使用率 expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 #cpu 使用率大於 80％ for: 1m labels: name: cpu severity: critical annotations: summary: "{{$labels.mountpoint}} CPU 使用率過高" description: "{{$labels.mountpoint }} CPU 使用大於80%" value: "{{ $value }}%" - alert: mem usage #mem 使用率 expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ))* 100 > 80 #mem 使用率大於 80% for: 1m labels: name: memory severity: critical annotations: summary: "{{$labels.mountpoint}} 記憶體使用率過高" description: "{{$labels.mountpoint }} 記憶體使用率大於 85%" value: "{{ $value }}%" - alert: disk usage #disk 使用率 expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80 #disk 使用率大於 80％ for: 1m labels: name: disk severity: critical annotations: summary: "{{$labels.mountpoint}} 硬碟空間使用率過高" description: "{{$labels.mountpoint }} 硬碟空間使用率過高 80%" value: "{{ $value }}%" - alert: network in #進來的網路流量 expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100 #2 分鐘內流入的網路流量大於 100M for: 1m labels: name: network severity: critical annotations: summary: "{{$labels.mountpoint}} 流入網路流量過高" description: "{{$labels.mountpoint }}流入網路異常,高於 100M " value: "{{ $value }}" - alert: network out #出去的網路流量 expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100 #2 分鐘內流出的網路流量大於 100M for: 1m labels: name: network severity: critical annotations: summary: "{{$labels.mountpoint}} 流出網路流量過高" description: "{{$labels.mountpoint }}流出網路異常,高於 100M " value: "{{ $value }}" - alert: probe failed #監控目標聯通性 expr: probe_success == 0 #監控目標聯通性異常 for: 1m labels: name: blackbox severity: critical annotations: summary: Probe failed (instance {{ $labels.instance }})" description: "Probe failed LABELS: {{ $labels }}" value: "{{ $value }}" - alert: http status code #http 狀態碼 expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400 #http 狀態碼小於等於 199，或大於等於 400 for: 1m labels: name: blackbox severity: critical annotations: summary: "Status Code (instance {{ $labels.instance }})" description: "HTTP status code is not 200-299 LABELS: {{ $labels }}" value: "{{ $value }}" ```