# 監控硬體設備(Prometheus + Exporter + Grafana) ###### tags: `Skill` > [name=echo][time=Tue, Mar 29, 2022 5:30 PM] > [TOC] --- ### 目的 - 監控硬體設備使用率,超出預期的使用率,則透過Email發出Alert - 透過Prometheus + Node Exporter監控設備指標 - Grafana將指標呈現在Dashboard上,且設定Alert條件和方式(ex: Email) ### 設計架構與準則 - 監控硬體設備並設定Alert發信機制 - 監控設備機器(12台) - KS(1台) - AA VM(10.128.128.177) - TPE(4台) - 10.109.6.10, 10.109.6.12, 10.109.6.13, 10.109.6.14 - CQ(7台) - 10.142.3.58, 10.142.3.59, 10.142.3.60, 10.142.3.61, 10.142.3.62, 10.142.3.63, 10.142.3.64 - 監控設備都有安裝Prometheus + Node Exporter  - Dashboard呈現與Alert機制 - 只有在台北TPE02(10.109.6.13)安裝Grafana,每台監控設備都會將指標傳到TPE02,TPE02統一呈現指標與設定Alert機制 - http://10.109.6.13:4000  - Alert準則 - OS Disk 使用率> 80% - Data Disk 使用率> 80% - Memory 使用率> 80% - TODO: GPU Memory 使用率>90% ### 安裝Prometheus - Port: 7070 - ex: http://10.109.6.10:7070 - Installation - 下載安裝檔 ```script wget https://github.com/prometheus/prometheus/releases/download/v2.27.1/prometheus-2.27.1.linux-amd64.tar.gz tar xvfz prometheus-*.tar.gz cd prometheus-* ``` - Add prometheus user ```script sudo useradd --no-create-home --shell /bin/false prometheus sudo mkdir /etc/prometheus sudo mkdir /var/lib/prometheus sudo chown prometheus:prometheus /etc/prometheus sudo chown prometheus:prometheus /var/lib/prometheus sudo cp /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/prometheus /usr/local/bin/ sudo cp /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/promtool /usr/local/bin/ sudo chown prometheus:prometheus /usr/local/bin/prometheus sudo chown prometheus:prometheus /usr/local/bin/promtool sudo cp -r /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/consoles /etc/prometheus sudo cp -r /mnt/hdd1/prometheus/prometheus-2.27.1.linux-amd64/console_libraries /etc/prometheus sudo chown -R prometheus:prometheus /etc/prometheus/consoles sudo chown -R prometheus:prometheus /etc/prometheus/console_libraries sudo cp /mnt/hdd1/prometheus/prometheus.yml /etc/prometheus sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml ``` - Setting service - sudo vim /etc/systemd/system/prometheus.service ```script [Unit] Description=Prometheus Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus \ --config.file /etc/prometheus/prometheus.yml \ --storage.tsdb.path /var/lib/prometheus/ \ --web.console.templates=/etc/prometheus/consoles \ --web.console.libraries=/etc/prometheus/console_libraries \ --web.listen-address="0.0.0.0:7070" [Install] WantedBy=multi-user.target ``` - enable service ```script sudo systemctl daemon-reload sudo systemctl restart prometheus sudo systemctl enable prometheus sudo systemctl status prometheus ``` ### 安裝Node Exporter - Port: 9100 - ex: http://10.109.6.10:9100 - 官方建議直接安裝在host - 下載安裝檔 ```script cd /mnt/hdd1/prometheus wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz ``` - 解壓縮 ```script tar xvfz node_exporter-*.*-amd64.tar.gz mkdir node_exporter mv node_exporter-1.1.2.linux-amd64 node_exporter cd node_exporter/node_exporter-*.*-amd64 ``` - Add node_exporter user ```script sudo useradd -rs /bin/false node_exporter ``` - Setting service ```script sudo nano /etc/systemd/system/node_exporter.service ``` ```script [Unit] Description=Node Exporter After=network.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/mnt/hdd1/prometheus/node_exporter/node_exporter-1.1.2.linux-amd64/node_exporter [Install] WantedBy=multi-user.target ``` - 重啟Service ```script sudo systemctl daemon-reload sudo systemctl start node_exporter sudo systemctl enable node_exporter sudo systemctl status node_exporter ``` - 確認Node exporter有安裝成功 ```Script http://10.109.6.13:9100/metrics ``` ### 安裝DCGM Exporter * 官方說安裝dcgm exporter之前須確保 * Golang >= 1.14 installed * DCGM installed * Installation * 安裝DCGM * 去官網註冊會員下載DCGM(https://developer.nvidia.com/dcgm) ```Script sudo dpkg -i datacenter-gpu-manager_2.1.4_amd64.deb # 檢查是否安裝成功 dcgmi --version ``` * 設定Golang環境 ```Script # 下載最新版本的Golang wget https://dl.google.com/go/go1.17.2.linux-amd64.tar.gz sudo tar -C /usr/local -xzf go1.17.2.linux-amd64.tar.gz # 設定Golang環境變數 vi ~/.profile # 加入以下內容 export GOROOT=/usr/local/go export GOPATH=$HOME/go export PATH=$GOPATH/bin:$GOROOT/bin:$PATH # 儲存後執行以下指令讓設定生效 source ~/.profile # 檢查是否安裝成功 go version ``` * 安裝DCGM Exporter * (https://github.com/NVIDIA/dcgm-exporter) ```Script unzip dcgm-exporter-main.zip cd dcgm-exporter-main/ # 執行前請確認有連外網,並調整優先序高於內網 # install一定要是root權限,需指定使用user environment make binary sudo env PATH="$PATH" make install ``` * Setting Service ```Script sudo vim /etc/systemd/system/dcgm-exporter.service - 寫下以下內容 [Unit] Description=dcgm-exporter service [Service] User=root ExecStart=/usr/bin/dcgm-exporter TimeoutStopSec=10 Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target ``` * 啟動 ```Script # 啟動 sudo systemctl daemon-reload sudo systemctl enable dcgm-exporter sudo systemctl start dcgm-exporter # 查看運行狀態 sudo systemctl status dcgm-exporter ``` * 確認是否安裝成功 ```Script http://10.109.6.10:9400/metrics ``` * 修改Prometheus Config ```Script # root才有寫入權限 sudo vi /etc/prometheus/prometheus.yml # 寫下以下內容 - job_name: 'dcgm' scrape_interval: 5s static_configs: - targets: ['localhost:9400'] # restart prometheus sudo systemctl restart prometheus sudo systemctl enable prometheus ``` ### 安裝Grafana - Port: 4000 - only 13: http://10.109.6.13:4000 - Installation - 下載安裝檔 ```script sudo apt-get install -y adduser libfontconfig1 wget https://dl.grafana.com/oss/release/grafana_7.5.7_amd64.deb sudo dpkg -i grafana_7.5.7_amd64.deb ``` - Grafana SMTP Configuration - sudo vi /etc/grafana/grafana.ini ```script [server] http_port = 4000 [smtp] enabled = true host = 10.110.15.79:25 user = password = cert_file = key_file = skip_verify = true from_address = AA@compal.com from_name = Grafana ;ehlo_identity = dashboard.example.com ;startTLS_policy = NoStartTLS ``` - 設定完,重啟Grafana服務 ```script /etc/init.d/grafana-server restart ``` - Start service after reboot ```script sudo systemctl enable grafana-server.service ``` - Login ```script user:admin password:admin ```
×
Sign in
Email
Password
Forgot password
or
Sign in via Google
Sign in via Facebook
Sign in via X(Twitter)
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
Continue with a different method
New to HackMD?
Sign up
By signing in, you agree to our
terms of service
.