# 監控硬體設備(Prometheus + Exporter + Grafana)
###### tags: `Skill`
> [name=echo][time=Tue, Mar 29, 2022 5:30 PM]
> [TOC]
---
### 目的
- 監控硬體設備使用率,超出預期的使用率,則透過Email發出Alert
- 透過Prometheus + Node Exporter監控設備指標
- Grafana將指標呈現在Dashboard上,且設定Alert條件和方式(ex: Email)
### 設計架構與準則
- 監控硬體設備並設定Alert發信機制
- 監控設備機器(12台)
- KS(1台)
- AA VM(10.128.128.177)
- TPE(4台)
- 10.109.6.10, 10.109.6.12, 10.109.6.13, 10.109.6.14
- CQ(7台)
- 10.142.3.58, 10.142.3.59, 10.142.3.60, 10.142.3.61, 10.142.3.62, 10.142.3.63, 10.142.3.64
- 監控設備都有安裝Prometheus + Node Exporter

- Dashboard呈現與Alert機制
- 只有在台北TPE02(10.109.6.13)安裝Grafana,每台監控設備都會將指標傳到TPE02,TPE02統一呈現指標與設定Alert機制
- http://10.109.6.13:4000

- Alert準則
- OS Disk 使用率> 80%
- Data Disk 使用率> 80%
- Memory 使用率> 80%
- TODO: GPU Memory 使用率>90%
### 安裝Prometheus
- Port: 7070
- ex: http://10.109.6.10:7070
- Installation
- 下載安裝檔
```script
wget https://github.com/prometheus/prometheus/releases/download/v2.27.1/prometheus-2.27.1.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*
```
- Add prometheus user
```script
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
sudo cp /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/prometheus /usr/local/bin/
sudo cp /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool
sudo cp -r /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/consoles /etc/prometheus
sudo cp -r /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/console_libraries /etc/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus/consoles
sudo chown -R prometheus:prometheus /etc/prometheus/console_libraries
sudo cp /mnt/hdd1/prometheus/prometheus.yml /etc/prometheus
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
```
- Setting service
- sudo vim /etc/systemd/system/prometheus.service
```script
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address="0.0.0.0:7070"
[Install]
WantedBy=multi-user.target
```
- enable service
```script
sudo systemctl daemon-reload
sudo systemctl restart prometheus
sudo systemctl enable prometheus
sudo systemctl status prometheus
```
### 安裝Node Exporter
- Port: 9100
- ex: http://10.109.6.10:9100
- 官方建議直接安裝在host
- 下載安裝檔
```script
cd /mnt/hdd1/prometheus
wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
```
- 解壓縮
```script
tar xvfz node_exporter-*.*-amd64.tar.gz
mkdir node_exporter
mv node_exporter-1.1.2.linux-amd64 node_exporter
cd node_exporter/node_exporter-*.*-amd64
```
- Add node_exporter user
```script
sudo useradd -rs /bin/false node_exporter
```
- Setting service
```script
sudo nano /etc/systemd/system/node_exporter.service
```
```script
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/mnt/hdd1/prometheus/node_exporter/node_exporter-1.1.2.linux-amd64/node_exporter
[Install]
WantedBy=multi-user.target
```
- 重啟Service
```script
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
sudo systemctl status node_exporter
```
- 確認Node exporter有安裝成功
```Script
http://10.109.6.13:9100/metrics
```
### 安裝DCGM Exporter
* 官方說安裝dcgm exporter之前須確保
* Golang >= 1.14 installed
* DCGM installed
* Installation
* 安裝DCGM
* 去官網註冊會員下載DCGM(https://developer.nvidia.com/dcgm)
```Script
sudo dpkg -i datacenter-gpu-manager_2.1.4_amd64.deb
# 檢查是否安裝成功
dcgmi --version
```
* 設定Golang環境
```Script
# 下載最新版本的Golang
wget https://dl.google.com/go/go1.17.2.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.17.2.linux-amd64.tar.gz
# 設定Golang環境變數
vi ~/.profile
# 加入以下內容
export GOROOT=/usr/local/go
export GOPATH=$HOME/go
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
# 儲存後執行以下指令讓設定生效
source ~/.profile
# 檢查是否安裝成功
go version
```
* 安裝DCGM Exporter
* (https://github.com/NVIDIA/dcgm-exporter)
```Script
unzip dcgm-exporter-main.zip
cd dcgm-exporter-main/
# 執行前請確認有連外網,並調整優先序高於內網
# install一定要是root權限,需指定使用user environment
make binary
sudo env PATH="$PATH" make install
```
* Setting Service
```Script
sudo vim /etc/systemd/system/dcgm-exporter.service
- 寫下以下內容
[Unit]
Description=dcgm-exporter service
[Service]
User=root
ExecStart=/usr/bin/dcgm-exporter
TimeoutStopSec=10
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
```
* 啟動
```Script
# 啟動
sudo systemctl daemon-reload
sudo systemctl enable dcgm-exporter
sudo systemctl start dcgm-exporter
# 查看運行狀態
sudo systemctl status dcgm-exporter
```
* 確認是否安裝成功
```Script
http://10.109.6.10:9400/metrics
```
* 修改Prometheus Config
```Script
# root才有寫入權限
sudo vi /etc/prometheus/prometheus.yml
# 寫下以下內容
- job_name: 'dcgm'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9400']
# restart prometheus
sudo systemctl restart prometheus
sudo systemctl enable prometheus
```
### 安裝Grafana
- Port: 4000
- only 13: http://10.109.6.13:4000
- Installation
- 下載安裝檔
```script
sudo apt-get install -y adduser libfontconfig1
wget https://dl.grafana.com/oss/release/grafana_7.5.7_amd64.deb
sudo dpkg -i grafana_7.5.7_amd64.deb
```
- Grafana SMTP Configuration
- sudo vi /etc/grafana/grafana.ini
```script
[server]
http_port = 4000
[smtp]
enabled = true
host = 10.110.15.79:25
user =
password =
cert_file =
key_file =
skip_verify = true
from_address = AA@compal.com
from_name = Grafana
;ehlo_identity = dashboard.example.com
;startTLS_policy = NoStartTLS
```
- 設定完,重啟Grafana服務
```script
/etc/init.d/grafana-server restart
```
- Start service after reboot
```script
sudo systemctl enable grafana-server.service
```
- Login
```script
user:admin
password:admin
```