# 監控硬體設備(Prometheus + Exporter + Grafana) ###### tags: `Skill` > [name=echo][time=Tue, Mar 29, 2022 5:30 PM] > [TOC] --- ### 目的 - 監控硬體設備使用率,超出預期的使用率,則透過Email發出Alert - 透過Prometheus + Node Exporter監控設備指標 - Grafana將指標呈現在Dashboard上,且設定Alert條件和方式(ex: Email) ### 設計架構與準則 - 監控硬體設備並設定Alert發信機制 - 監控設備機器(12台) - KS(1台) - AA VM(10.128.128.177) - TPE(4台) - 10.109.6.10, 10.109.6.12, 10.109.6.13, 10.109.6.14 - CQ(7台) - 10.142.3.58, 10.142.3.59, 10.142.3.60, 10.142.3.61, 10.142.3.62, 10.142.3.63, 10.142.3.64 - 監控設備都有安裝Prometheus + Node Exporter ![](https://i.imgur.com/YczKqZU.png) - Dashboard呈現與Alert機制 - 只有在台北TPE02(10.109.6.13)安裝Grafana,每台監控設備都會將指標傳到TPE02,TPE02統一呈現指標與設定Alert機制 - http://10.109.6.13:4000 ![](https://i.imgur.com/JCX6vU2.png) - Alert準則 - OS Disk 使用率> 80% - Data Disk 使用率> 80% - Memory 使用率> 80% - TODO: GPU Memory 使用率>90% ### 安裝Prometheus - Port: 7070 - ex: http://10.109.6.10:7070 - Installation - 下載安裝檔 ```script wget https://github.com/prometheus/prometheus/releases/download/v2.27.1/prometheus-2.27.1.linux-amd64.tar.gz tar xvfz prometheus-*.tar.gz cd prometheus-* ``` - Add prometheus user ```script sudo useradd --no-create-home --shell /bin/false prometheus sudo mkdir /etc/prometheus sudo mkdir /var/lib/prometheus sudo chown prometheus:prometheus /etc/prometheus sudo chown prometheus:prometheus /var/lib/prometheus sudo cp /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/prometheus /usr/local/bin/ sudo cp /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/promtool /usr/local/bin/ sudo chown prometheus:prometheus /usr/local/bin/prometheus sudo chown prometheus:prometheus /usr/local/bin/promtool sudo cp -r /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/consoles /etc/prometheus sudo cp -r /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/console_libraries /etc/prometheus sudo chown -R prometheus:prometheus /etc/prometheus/consoles sudo chown -R prometheus:prometheus /etc/prometheus/console_libraries sudo cp /mnt/hdd1/prometheus/prometheus.yml /etc/prometheus sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml ``` - Setting service - sudo vim /etc/systemd/system/prometheus.service ```script [Unit] Description=Prometheus Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus \ --config.file /etc/prometheus/prometheus.yml \ --storage.tsdb.path /var/lib/prometheus/ \ --web.console.templates=/etc/prometheus/consoles \ --web.console.libraries=/etc/prometheus/console_libraries \ --web.listen-address="0.0.0.0:7070" [Install] WantedBy=multi-user.target ``` - enable service ```script sudo systemctl daemon-reload sudo systemctl restart prometheus sudo systemctl enable prometheus sudo systemctl status prometheus ``` ### 安裝Node Exporter - Port: 9100 - ex: http://10.109.6.10:9100 - 官方建議直接安裝在host - 下載安裝檔 ```script cd /mnt/hdd1/prometheus wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz ``` - 解壓縮 ```script tar xvfz node_exporter-*.*-amd64.tar.gz mkdir node_exporter mv node_exporter-1.1.2.linux-amd64 node_exporter cd node_exporter/node_exporter-*.*-amd64 ``` - Add node_exporter user ```script sudo useradd -rs /bin/false node_exporter ``` - Setting service ```script sudo nano /etc/systemd/system/node_exporter.service ``` ```script [Unit] Description=Node Exporter After=network.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/mnt/hdd1/prometheus/node_exporter/node_exporter-1.1.2.linux-amd64/node_exporter [Install] WantedBy=multi-user.target ``` - 重啟Service ```script sudo systemctl daemon-reload sudo systemctl start node_exporter sudo systemctl enable node_exporter sudo systemctl status node_exporter ``` - 確認Node exporter有安裝成功 ```Script http://10.109.6.13:9100/metrics ``` ### 安裝DCGM Exporter * 官方說安裝dcgm exporter之前須確保 * Golang >= 1.14 installed * DCGM installed * Installation * 安裝DCGM * 去官網註冊會員下載DCGM(https://developer.nvidia.com/dcgm) ```Script sudo dpkg -i datacenter-gpu-manager_2.1.4_amd64.deb # 檢查是否安裝成功 dcgmi --version ``` * 設定Golang環境 ```Script # 下載最新版本的Golang wget https://dl.google.com/go/go1.17.2.linux-amd64.tar.gz sudo tar -C /usr/local -xzf go1.17.2.linux-amd64.tar.gz # 設定Golang環境變數 vi ~/.profile # 加入以下內容 export GOROOT=/usr/local/go export GOPATH=$HOME/go export PATH=$GOPATH/bin:$GOROOT/bin:$PATH # 儲存後執行以下指令讓設定生效 source ~/.profile # 檢查是否安裝成功 go version ``` * 安裝DCGM Exporter * (https://github.com/NVIDIA/dcgm-exporter) ```Script unzip dcgm-exporter-main.zip cd dcgm-exporter-main/ # 執行前請確認有連外網,並調整優先序高於內網 # install一定要是root權限,需指定使用user environment make binary sudo env PATH="$PATH" make install ``` * Setting Service ```Script sudo vim /etc/systemd/system/dcgm-exporter.service - 寫下以下內容 [Unit] Description=dcgm-exporter service [Service] User=root ExecStart=/usr/bin/dcgm-exporter TimeoutStopSec=10 Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target ``` * 啟動 ```Script # 啟動 sudo systemctl daemon-reload sudo systemctl enable dcgm-exporter sudo systemctl start dcgm-exporter # 查看運行狀態 sudo systemctl status dcgm-exporter ``` * 確認是否安裝成功 ```Script http://10.109.6.10:9400/metrics ``` * 修改Prometheus Config ```Script # root才有寫入權限 sudo vi /etc/prometheus/prometheus.yml # 寫下以下內容 - job_name: 'dcgm' scrape_interval: 5s static_configs: - targets: ['localhost:9400'] # restart prometheus sudo systemctl restart prometheus sudo systemctl enable prometheus ``` ### 安裝Grafana - Port: 4000 - only 13: http://10.109.6.13:4000 - Installation - 下載安裝檔 ```script sudo apt-get install -y adduser libfontconfig1 wget https://dl.grafana.com/oss/release/grafana_7.5.7_amd64.deb sudo dpkg -i grafana_7.5.7_amd64.deb ``` - Grafana SMTP Configuration - sudo vi /etc/grafana/grafana.ini ```script [server] http_port = 4000 [smtp] enabled = true host = 10.110.15.79:25 user = password = cert_file = key_file = skip_verify = true from_address = AA@compal.com from_name = Grafana ;ehlo_identity = dashboard.example.com ;startTLS_policy = NoStartTLS ``` - 設定完,重啟Grafana服務 ```script /etc/init.d/grafana-server restart ``` - Start service after reboot ```script sudo systemctl enable grafana-server.service ``` - Login ```script user:admin password:admin ```