---
title: Prometheus # 簡報的名稱
tags: Prometheus # 簡報的標籤
---
# Prometheus
> [name=翁維甫]
> [time=Thur, Jul 23, 2020 4:00 PM]
---
# Agenda
* Prometheus 介紹
* Prometheus 建立
* Dashboard
* Alert 系統
---
## Prometheus 介紹
---
### 發展史

---
### 特性
* 多維度資料模型
* 時間序列資料透過 Metric 名稱與 Key-value 來區分
* 所有 Metrics 可以設定任意的多維標籤
* 資料模型彈性度高,不需要刻意設定為以特定符號(ex: ,)分割
* 可對資料模型進行聚合、切割與切片操作
* 支援雙精度浮點數類型,標籤可以設定成 Unicode
---
* 靈活的查詢語言(PromQL),可進行加減乘除等
* 不依賴分散式儲存,因為 Prometheus Server 是一個二進位檔,可在單個服務節點自主運行
* 透過 HTTP 的 Pull 方式收集時序資料
* 可以透過 Push Gateway 進行資料推送
* 支援多種視覺化儀表板呈現,如 Grafana
* 能透過"**動態服務發現(Service discovery)**"或"**靜態文件配置**"去獲取監控的 Targets
---
### 架構圖

---
### Prometheus VS InfluxDB
* InfluxDB:僅僅是一個資料庫,它被動的接受客戶端資料和查詢請求,基於 Push
* Prometheus:完整的監控系統,能抓取資料、查詢資料、告警等功能,基於 Pull
* Push 和 Pull 主要區別在發起者不同及邏輯架構不同
---
### Push、Pull

---
### Metric Type
* Counter(計數器)
* Gauge(儀表板)
* Histogram(直方圖)
* Summary(摘要)
---
### 監控系統

---
## Prometheus 建立
---
### 安裝 Docker
```linux=
sudo apt-get install docker.io
```
### 安裝 Docker-compose
```linux=
sudo curl -L "https://github.com/docker/compose/releases/download/1.25.4/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
```
---
### 編寫 Docker-compose 的 yaml
```linux=
vim xxxx.yaml #文件名稱自定義
```
```yaml=
version: '2'
networks: #指定網路
service_net:
driver: bridge #自訂名稱
ipam: #自訂靜態 IP
config:
- subnet: 172.22.238.0/24 #CIDR 格式子網路
gateway: 172.22.238.1
services: #要啟動的服務
prometheus: #新增服務 Prometheus
image: prom/prometheus #Docker Hub Image
hostname: prometheus #容器內部 hostname
restart: always #容器自動啟動
volumes: #要從本地資料夾掛載進去的資料
- /data/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- /data/prometheus/rules.yml:/etc/prometheus/rules.yml
- /data/prometheus/prometheus-data:/prometheus
command:
- '--web.enable-lifecycle' #啟用 reload
- '--config.file=/etc/prometheus/prometheus.yml' #指定容器中的配置文件
ports: #將容器的 port 映射出來 (vm-port:container-port)
- '9090:9090'
networks:
service_net:
ipv4_address: 172.22.238.10
user: 0:0 #root 權限
alertmanager: #新增服務 Alertmanager
image: prom/alertmanager
hostname: alertmanager
restart: always
volumes:
- /data/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
ports:
- "9093:9093"
networks:
service_net:
ipv4_address: 172.22.238.11
grafana: #新增服務 Grafana
image: grafana/grafana
hostname: grafana
restart: always
environment: #用環境變量來安裝套件
- GF_INSTALL_PLUGINS=grafana-piechart-panel
ports:
- "3000:3000"
networks:
service_net:
ipv4_address: 172.22.238.12
node-exporter: #新增服務 Node-exporter
image: quay.io/prometheus/node-exporter
hostname: node-exporter
restart: always
ports:
- "9100:9100"
networks:
service_net:
ipv4_address: 172.22.238.13
cadvisor: #新增服務 Cadvisor
image: google/cadvisor:latest
hostname: cadvisor
restart: always
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8080:8080"
networks:
service_net:
ipv4_address: 172.22.238.14
black-exporter: #新增服務 Black-exporter
image: prom/blackbox-exporter:v0.17.0
hostname: black-exporter
restart: always
volumes:
- /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
- /data/blackbox/blackbox.yml:/config/blackbox.yml
command:
- '--config.file=/config/blackbox.yml'
ports:
- '9115:9115'
networks:
service_net:
ipv4_address: 172.22.238.15
```
---
### Prometheus 設定檔
建立資料夾
```linux=
mkdir -p /data/prometheus
```
編輯設定檔
```linux=
vim /data/prometheus/prometheus.yml
```
```yaml=
global:
scrape_interval: 15s # 預設 scrape 的拉取間隔時間
evaluation_interval: 15s # 規則掃描時間間隔
alerting:
alertmanagers:
- static_configs:
- targets:
- 10.X.X.X:9093
rule_files:
- "rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['prometheus:9090']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
labels:
instance: cadvisor
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
labels:
instance: node
- job_name: 'black-exporter'
scrape_interval: 10s
static_configs:
- targets: ['black-exporter:9115']
labels:
instance: black_box
- job_name: 'icmp'
scrape_interval: 5s
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets: ['10.140.20.230','10.140.20.232']
labels:
group: 'QC+RD'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: ping
- target_label: __address__
replacement: 10.X.X.X:9115
- job_name: 'http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- kennyweng.github.io
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.X.X.X:9115
```
---
### Blackbox 設定檔
建立資料夾
```linux=
mkdir -p /data/blackbox
```
編輯設定檔
```linux=
vim /data/blackbox/blackbox.yml
```
```yaml=
modules:
http_2xx: #http 監控模組
prober: http
http_post_2xx: #http post 監控模組
prober: http
http:
method: POST
tcp_connect: #tcp 監控模組
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
tls_config:
insecure_skip_verify: false
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp: #icmp 檢測模組
prober: icmp
```
---
### Alert_rules 設定檔
編輯設定檔
```linux=
vim /data/prometheus/rules.yml
```
```yaml=
groups:
- name: probe_icmp_duration
rules:
- alert: icmp #告警規則名稱
expr: probe_icmp_duration_seconds > 1 #PromQL 表達式告警觸發條件 ( ping 回傳超過 1ms )
for: 10s #評估等待時間,觸發條件後持續一段時間後才發送告警,等待期間狀態為 pending
labels:
group: 'QC+RD'
annotations: #告警註解
summary: "ping is too high"
- name: check_ssl_status
rules:
- alert: "ssl 過期 "
expr: ceil((probe_ssl_earliest_cert_expiry - time())/86400) <600 #憑證剩餘時間
for: 1m
labels:
severity: warn
annotations:
value: '{{$value}}'
summary: 'ssl 即將到期'
description: 'kennyweng.github.io 剩 {{$value}} 天就過期了, 請盡快更換憑證'
- name: blackbox_network_stats
rules:
- alert: blackbox_network_stats
expr: probe_success == 0
for: 1m
labels:
severity: critical
annotations:
summary: 'kennyweng.github.io 無法連線'
description: '請盡快查詢'
```
---
### Alertmanager 設定檔
建立資料夾
```linux=
mkdir -p /data/alertmanager
```
編輯設定檔
```linux=
vim /data/alertmanager/alertmanager.yml
```
```yaml=
global:
resolve_timeout: 1m #處理超時時間,預設為5m
route:
group_by: ['alertname']
receiver: 'slack'
group_wait: 5s #收發訊息等待時間
group_interval: 10s #相同 Group 發送告警時間間隔
repeat_interval: 30m #告警時間間隔
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/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'
channel: '#kibana' #頻道
send_resolved: true #告警恢復通知
text: '救命啊' #告警訊息
title_link: 'http://10.140.20.54:9093/#/alerts' #告警連結
```
---
### 啟動服務
啟動 Docker-compose
```linux=
docker-compose -f /home/08admin/dc.yaml up -d
```

看一下狀態
```linux=
docker ps -a
```

---
### Acceptance
瀏覽器輸入 IP 帶上 9090 Port 即可進入 Prometheus web UI

---
## Dashboard
瀏覽器輸入 IP 帶上 3000 Port 登入 Grafana

---
### 新增 Data Sources

---
## Alert 系統
---
### 告警示意圖

---
### 告警狀態
* Inactive
* 沒有觸發閾值
* Pending
* 已觸發閾值但未滿足告警持續時間
* Firing
* 已觸發閾值且滿足告警持續時間
---
### 告警範例
```yaml=
groups:
- name: probe_icmp_duration
rules:
- alert: icmp
expr: probe_icmp_duration_seconds > 1
for: 10s
labels:
group: 'QC+RD'
annotations:
summary: "ping is too high"
```

---
### Alert Manager 特性

---
### 發送端、接收端流程圖

---
### 告警收斂流程圖

---
### 告警收斂
* Group 分組
* 整合同類的告警,幫助維運單位排查問題
* 通過告警郵件、訊息的合併,減少告警數量
* Inhibition 抑制
* 消除冗餘的告警
* 高級別告警抑制低級別告警
* Silences 靜默
* 阻止發送可預期的告警
* 確保處理期間不會收到重複的告警
---
### 抑制機制
```yaml=
inhibit_rules:
- source_match:
alertname: NodeDown
severity: 'critical'
target_match:
severity: 'warning'
#若告警名稱相同,critical 級別的告警會抑制 warning 級別的告警
equal: ['alertname']
```
---
### 靜默機制

---
### 告警延時流程圖

---
### 告警延時
```yaml=
group_wait: 5s #分組等待時間
group_interval: 5m #分組嘗試再次發送告警的時間間隔
repeat_interval: 60m #分組內發送相同告警的時間間隔
```


---
### 告警範例
```yaml=
groups:
- name: ServiceStatus
rules:
- alert: prometheus down #Prometheus 死亡
expr: prometheus_config_last_reload_successful != 1 #Prometheus 最後一次載入設定檔失敗
for: 1m
labels:
name: prometheus
severity: error
annotations:
summary: "prometheus down (instance {{ $labels.instance }})"
description: "prometheus instance is down"
value: "{{ $value }}"
- alert: alertmanager down #Alertmanager 死亡
expr: alertmanager_config_last_reload_successful != 1 #Alertmanager 最後一次載入設定檔失敗
for: 1m
labels:
name: alertmanager
severity: error
annotations:
summary: "alertmanager down (instance {{ $labels.instance }})"
description: "alertmanager instance is down"
value: "{{ $value }}"
- alert: cpu usege load #cpu 使用率
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 #cpu 使用率大於 80%
for: 1m
labels:
name: cpu
severity: critical
annotations:
summary: "{{$labels.mountpoint}} CPU 使用率過高"
description: "{{$labels.mountpoint }} CPU 使用大於80%"
value: "{{ $value }}%"
- alert: mem usage #mem 使用率
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ))* 100 > 80 #mem 使用率大於 80%
for: 1m
labels:
name: memory
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 記憶體使用率過高"
description: "{{$labels.mountpoint }} 記憶體使用率大於 85%"
value: "{{ $value }}%"
- alert: disk usage #disk 使用率
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80 #disk 使用率大於 80%
for: 1m
labels:
name: disk
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 硬碟空間使用率過高"
description: "{{$labels.mountpoint }} 硬碟空間使用率過高 80%"
value: "{{ $value }}%"
- alert: network in #進來的網路流量
expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100 #2 分鐘內流入的網路流量大於 100M
for: 1m
labels:
name: network
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 流入網路流量過高"
description: "{{$labels.mountpoint }}流入網路異常,高於 100M "
value: "{{ $value }}"
- alert: network out #出去的網路流量
expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100 #2 分鐘內流出的網路流量大於 100M
for: 1m
labels:
name: network
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 流出網路流量過高"
description: "{{$labels.mountpoint }}流出網路異常,高於 100M "
value: "{{ $value }}"
- alert: probe failed #監控目標聯通性
expr: probe_success == 0 #監控目標聯通性異常
for: 1m
labels:
name: blackbox
severity: critical
annotations:
summary: Probe failed (instance {{ $labels.instance }})"
description: "Probe failed LABELS: {{ $labels }}"
value: "{{ $value }}"
- alert: http status code #http 狀態碼
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400 #http 狀態碼小於等於 199,或大於等於 400
for: 1m
labels:
name: blackbox
severity: critical
annotations:
summary: "Status Code (instance {{ $labels.instance }})"
description: "HTTP status code is not 200-299 LABELS: {{ $labels }}"
value: "{{ $value }}"
```