---
tags: prometheus, WAMS
---
# k8s監控方案
[TOC]

## 主要元件介紹
### Prometheus-operator
> github: https://github.com/prometheus-operator/prometheus-operator
**Prometheus Operator**是**CoreOS**開源的一套用於管理在 Kubernetes上的Prometheus控制器,利用custom,目標是<font color='red'>**簡化部署與維護 Prometheus**</font>上的事情。
:::spoiler Prometheus-operator有以下特點
1. **Kubernetes Custom Resources**: Use Kubernetes <font color='red'>custom resources</font> to deploy and manage Prometheus, Alertmanager, and related components.
2. **Simplified Deployment Configuration**: Configure the fundamentals of Prometheus like versions, persistence, retention policies, and replicas from a native Kubernetes resource.
3. **Prometheus Target Configuration**: Automatically generate monitoring target configurations based on familiar Kubernetes label queries; no need to learn a Prometheus specific configuration language.
:::
:::spoiler Prometheus-operator包含以下主要元件:
* `Prometheus`: which defines a desired Prometheus deployment.
* `Alertmanager`: which defines a desired Alertmanager deployment.
* `ThanosRuler`: which defines a desired Thanos Ruler deployment.
* `ServiceMonitor`: which declaratively specifies how groups of Kubernetes **services** should be monitored. The Operator <font color='red'>**automatically generates Prometheus scrape configuration**</font> based on the current state of the objects in the API server.
* `PodMonitor`: which declaratively specifies how group of **pods** should be monitored. The Operator <font color='red'>**automatically generates Prometheus scrape configuration**</font> based on the current state of the objects in the API server.
* `Probe`: which declaratively specifies how groups of ingresses or static targets should be monitored. The Operator automatically generates Prometheus scrape configuration based on the definition.
* `PrometheusRule`: which defines a desired set of Prometheus alerting and/or recording rules. The Operator generates a rule file, which can be used by Prometheus instances.
* `AlertmanagerConfig`: which declaratively specifies subsections of the Alertmanager configuration, allowing routing of alerts to custom receivers, and setting inhibit rules.
:::
---
### Prometheus
> github: https://github.com/prometheus/prometheus

:::spoiler Prometheus與其他監控系統的差別
1. A multi-dimensional data model (time series defined by metric name and set of key/value dimensions)
2. <font color='red'>PromQL</font>, a powerful and flexible query language to leverage this dimensionality
3. No dependency on distributed storage; single server nodes are autonomous
4. An <font color='red'>HTTP pull model</font> for time series collection
5. Pushing time series is supported via an intermediary gateway(<font color='red'>push gateway</font>) for batch jobs
6. Targets are discovered via service discovery(EX: **consul**) or static configuration(**prometheus.yml**)
7. Multiple modes of graphing and dashboarding support
8. Support for hierarchical and horizontal <font color='red'>federation</font>
:::
---
### Node-exporter
> github: https://github.com/prometheus/node_exporter
Prometheus exporter for **hardware and OS metrics exposed** by *NIX kernels, written in Go with pluggable metric collectors.
---
### Kube-stat-metrics
> github: https://github.com/kubernetes/kube-state-metrics
> more exposed metrics: https://github.com/kubernetes/kube-state-metrics/tree/master/docs
kube-state-metrics is a simple service that **listens to the Kubernetes API server** and generates metrics(EX: `pod state`, `container state`, `endpoints`, `service`) about the state of the objects.
---
### Prometheus-Adapter
> github: https://github.com/kubernetes-sigs/prometheus-adapter
This repository contains an implementation of the Kubernetes [resource metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/resource-metrics-api.md), [custom metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md), and [external metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/external-metrics-api.md) APIs.
---
### Ingress
#### Ingress resource
主要制定routing rule,主要的設定會落在spec,以及依賴底下實作不同,額外設定的annotation。
#### Ingress controller
> Ingress controller種類: https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/#additional-controllers
負責實做底層的服務,再依照Ingress resource的設定去動態修改nginx pod裡的設定。
---
## 安裝流程
### Kube-prometheus
:::spoiler This project included in this package:
* Prometheus Operator
* Highly available Prometheus
* Highly available Alertmanager(建在本地端,gke不會建此服務)
* Prometheus node-exporter
* Prometheus Adapter for Kubernetes Metrics APIs
* Kube-state-metrics
* Grafana(建在本地端,gke不會建此服務)
:::
<br>
1. Check Prerequisites
kubelet configuration must contain these flags:
* `--authentication-token-webhook=true`
* `--authorization-mode=Webhook`
2. Check Compatibility
| kube-prometheus stack | Kubernetes 1.16 | Kubernetes 1.17 | Kubernetes 1.18 | Kubernetes 1.19 | Kubernetes 1.20 |
| --------------------- | --------------- | --------------- | --------------- | --------------- | --------------- |
| `release-0.4` | ✔ (v1.16.5+) | ✔ | ✗ | ✗ | ✗ |
| `release-0.5` | ✗ | ✗ | ✔ | ✗ | ✗ |
| `release-0.6` | ✗ | ✗ | ✗ | ✔ | ✗ |
| `release-0.7` | ✗ | ✗ | ✗ | ✔ | ✔ |
| `HEAD` | ✗ | ✗ | ✗ | ✔ | ✔ |
3. Clone project kube-prometheus
```bash=
$ git clone https://github.com/prometheus-operator/kube-prometheus.git
```
4. Checkout project version according to Compatibility
```bash=
$ git checkout <tag or commit SHA>
```
5. Remove grafana、alert manager yaml
```bash=
$ cd kube-prometheus
$ rm -f manifests/alertmanager-*
$ rm -f manifests/grafana-*
```
6. Create the monitoring stack using the config in the manifests directory
```bash= !
# Create the namespace and CRDs, and then wait for them to be available before creating the remaining resources
$ kubectl create -f manifests/setup
$ until kubectl get servicemonitors --all-namespaces ;do date; sleep 1; echo ""; done
$ kubectl create -f manifests/
```
To teardown the stack:
```bash=
$ kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup
```
7. (TEST,optional)Access the dashboards by port-forwarding
* Prometheus
```bash !
$ kubectl --namespace monitoring port-forward svc/prometheus-k8s 9090
```
Then access via http://{HOST_IP}:9090
* Grafana
```bash !
$ kubectl --namespace monitoring port-forward svc/grafana 3000
```
Then access via http://localhost:3000 and use the default grafana `user:password` of `admin:admin`.
### Consul
> https://hackmd.io/Uew-BqRWQz6qxSix47SMkg
### Prometheus
> prometheus stacks: http://ibdo.efoxconn.com:5000/QAD/prometheus-stack
docker-compose.yml
```yaml= !
version: "3.1"
services:
grafana:
image: grafana/grafana:7.3.5
container_name: grafana
ports:
- '3000:3000'
environment:
- GF_SECURITY_ADMIN_USER=ibdo # grafana user
- GF_SECURITY_ADMIN_PASSWORD=ibdo2018 # grafana password
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning/:/etc/grafana/provisioning/ # 放grafana dashborad json
restart: always
depends_on:
- prometheus
prometheus:
image: prom/prometheus:v2.23.0
container_name: prometheus
ports:
- '9090:9090'
command:
- '--config.file=/etc/prometheus/prometheus.yml' # 指定prometheus config path
- '--storage.tsdb.retention.time=2h' # 只保留兩個小時的資料
- '--web.enable-lifecycle' # 可以使用post去重新讀取設定
volumes:
- ./prometheus/config/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
restart: always
depends_on:
- influxdb
influxdb:
image: influxdb:1.8.3
container_name: influxdb
ports:
- '8086:8086'
environment:
- INFLUXDB_DB=prometheus
- INFLUXDB_ADMIN_USER=ibdo
- INFLUXDB_ADMIN_PASSWORD=ibdo2018
- INFLUXDB_DATA_MAX_SERIES_PER_DATABASE=0 # The maximum number of series allowed per database before writes are dropped.
restart: always
volumes:
- influxdb_data:/var/lib/influxdb
volumes:
prometheus_data: {}
grafana_data: {}
influxdb_data: {}
```
prometheus.yml
```yaml=
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
# GKE Kube-prometheus
- job_name: 'federate'
scrape_interval: 5m
scrape_timeout: 3m
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# kube_prometheus_metrics
- '{__name__="cass_jvm_heap"}'
- '{__name__="cass_jvm_heap_max"}'
- '{__name__="cass_jvm_noheap"}'
- '{__name__="cass_jvm_noheap_max"}'
- '{__name__="container_cpu_usage_seconds_total"}'
- '{__name__="container_fs_limit_bytes"}'
- '{__name__="container_fs_usage_bytes"}'
- '{__name__="container_memory_rss"}'
- '{__name__="container_memory_working_set_bytes"}'
- '{__name__="container_network_receive_bytes_total"}'
- '{__name__="container_network_transmit_bytes_total"}'
- '{__name__="kube_configmap_info"}'
- '{__name__="kube_namespace_labels"}'
- '{__name__="kube_node_info"}'
- '{__name__="kube_node_status_allocatable_cpu_cores"}'
- '{__name__="kube_node_status_capacity_cpu_cores"}'
- '{__name__="kube_node_status_capacity_memory_bytes"}'
- '{__name__="kube_node_status_capacity_pods"}'
- '{__name__="kube_node_status_condition"}'
- '{__name__="kube_pod_container_info"}'
- '{__name__="kube_pod_container_resource_limits_cpu_cores"}'
- '{__name__="kube_pod_container_resource_limits_memory_bytes"}'
- '{__name__="kube_pod_container_resource_requests_cpu_cores"}'
- '{__name__="kube_pod_container_resource_requests_memory_bytes"}'
- '{__name__="kube_pod_container_status_restarts_total"}'
- '{__name__="kube_pod_info"}'
- '{__name__="kube_pod_status_phase"}'
- '{__name__="kube_secret_info"}'
- '{__name__="kube_service_info"}'
- '{__name__="machine_cpu_cores"}'
- '{__name__="machine_memory_bytes"}'
- '{__name__="origin_prometheus"}'
# node_exporter_full_metrics
- '{__name__="node_arp_entries"}'
- '{__name__="node_context_switches_total"}'
- '{__name__="node_cooling_device_cur_state"}'
- '{__name__="node_cooling_device_max_state"}'
- '{__name__="node_cpu_seconds_total"}'
- '{__name__="node_disk_discard_time_seconds_total"}'
- '{__name__="node_disk_discards_completed_total"}'
- '{__name__="node_disk_discards_merged_total"}'
- '{__name__="node_disk_io_now"}'
- '{__name__="node_disk_io_time_seconds_total"}'
- '{__name__="node_disk_io_time_weighted_seconds_total"}'
- '{__name__="node_disk_read_bytes_total"}'
- '{__name__="node_disk_read_time_seconds_total"}'
- '{__name__="node_disk_reads_completed_total"}'
- '{__name__="node_disk_reads_merged_total"}'
- '{__name__="node_disk_write_time_seconds_total"}'
- '{__name__="node_disk_writes_completed_total"}'
- '{__name__="node_disk_writes_merged_total"}'
- '{__name__="node_disk_written_bytes_total"}'
- '{__name__="node_entropy_available_bits"}'
- '{__name__="node_filefd_allocated"}'
- '{__name__="node_filefd_maximum"}'
- '{__name__="node_filesystem_avail_bytes"}'
- '{__name__="node_filesystem_device_error"}'
- '{__name__="node_filesystem_files"}'
- '{__name__="node_filesystem_files_free"}'
- '{__name__="node_filesystem_free_bytes"}'
- '{__name__="node_filesystem_readonly"}'
- '{__name__="node_filesystem_size_bytes"}'
- '{__name__="node_forks_total"}'
- '{__name__="node_hwmon_temp_celsius"}'
- '{__name__="node_hwmon_temp_crit_alarm_celsius"}'
- '{__name__="node_hwmon_temp_crit_celsius"}'
- '{__name__="node_hwmon_temp_crit_hyst_celsius"}'
- '{__name__="node_hwmon_temp_max_celsius"}'
- '{__name__="node_interrupts_total"}'
- '{__name__="node_intr_total"}'
- '{__name__="node_load1"}'
- '{__name__="node_load15"}'
- '{__name__="node_load5"}'
- '{__name__="node_memory_Active_anon_bytes"}'
- '{__name__="node_memory_Active_bytes"}'
- '{__name__="node_memory_Active_file_bytes"}'
- '{__name__="node_memory_AnonHugePages_bytes"}'
- '{__name__="node_memory_AnonPages_bytes"}'
- '{__name__="node_memory_Bounce_bytes"}'
- '{__name__="node_memory_Buffers_bytes"}'
- '{__name__="node_memory_Cached_bytes"}'
- '{__name__="node_memory_CommitLimit_bytes"}'
- '{__name__="node_memory_Committed_AS_bytes"}'
- '{__name__="node_memory_DirectMap1G_bytes"}'
- '{__name__="node_memory_DirectMap2M_bytes"}'
- '{__name__="node_memory_DirectMap4k_bytes"}'
- '{__name__="node_memory_Dirty_bytes"}'
- '{__name__="node_memory_HardwareCorrupted_bytes"}'
- '{__name__="node_memory_HugePages_Free"}'
- '{__name__="node_memory_HugePages_Rsvd"}'
- '{__name__="node_memory_HugePages_Surp"}'
- '{__name__="node_memory_HugePages_Total"}'
- '{__name__="node_memory_Hugepagesize_bytes"}'
- '{__name__="node_memory_Inactive_anon_bytes"}'
- '{__name__="node_memory_Inactive_bytes"}'
- '{__name__="node_memory_Inactive_file_bytes"}'
- '{__name__="node_memory_KernelStack_bytes"}'
- '{__name__="node_memory_Mapped_bytes"}'
- '{__name__="node_memory_MemFree_bytes"}'
- '{__name__="node_memory_MemTotal_bytes"}'
- '{__name__="node_memory_Mlocked_bytes"}'
- '{__name__="node_memory_NFS_Unstable_bytes"}'
- '{__name__="node_memory_PageTables_bytes"}'
- '{__name__="node_memory_Percpu_bytes"}'
- '{__name__="node_memory_SReclaimable_bytes"}'
- '{__name__="node_memory_SUnreclaim_bytes"}'
- '{__name__="node_memory_ShmemHugePages_bytes"}'
- '{__name__="node_memory_ShmemPmdMapped_bytes"}'
- '{__name__="node_memory_Shmem_bytes"}'
- '{__name__="node_memory_Slab_bytes"}'
- '{__name__="node_memory_SwapCached_bytes"}'
- '{__name__="node_memory_SwapTotal_bytes"}'
- '{__name__="node_memory_Unevictable_bytes"}'
- '{__name__="node_memory_VmallocChunk_bytes"}'
- '{__name__="node_memory_VmallocTotal_bytes"}'
- '{__name__="node_memory_VmallocUsed_bytes"}'
- '{__name__="node_memory_WritebackTmp_bytes"}'
- '{__name__="node_memory_Writeback_bytes"}'
- '{__name__="node_netstat_Icmp_InErrors"}'
- '{__name__="node_netstat_Icmp_InMsgs"}'
- '{__name__="node_netstat_Icmp_OutMsgs"}'
- '{__name__="node_netstat_IpExt_InOctets"}'
- '{__name__="node_netstat_IpExt_OutOctets"}'
- '{__name__="node_netstat_Ip_Forwarding"}'
- '{__name__="node_netstat_TcpExt_ListenDrops"}'
- '{__name__="node_netstat_TcpExt_ListenOverflows"}'
- '{__name__="node_netstat_TcpExt_SyncookiesFailed"}'
- '{__name__="node_netstat_TcpExt_SyncookiesRecv"}'
- '{__name__="node_netstat_TcpExt_SyncookiesSent"}'
- '{__name__="node_netstat_TcpExt_TCPSynRetrans"}'
- '{__name__="node_netstat_Tcp_ActiveOpens"}'
- '{__name__="node_netstat_Tcp_CurrEstab"}'
- '{__name__="node_netstat_Tcp_InErrs"}'
- '{__name__="node_netstat_Tcp_InSegs"}'
- '{__name__="node_netstat_Tcp_MaxConn"}'
- '{__name__="node_netstat_Tcp_OutSegs"}'
- '{__name__="node_netstat_Tcp_PassiveOpens"}'
- '{__name__="node_netstat_Tcp_RetransSegs"}'
- '{__name__="node_netstat_UdpLite_InErrors"}'
- '{__name__="node_netstat_Udp_InDatagrams"}'
- '{__name__="node_netstat_Udp_InErrors"}'
- '{__name__="node_netstat_Udp_NoPorts"}'
- '{__name__="node_netstat_Udp_OutDatagrams"}'
- '{__name__="node_netstat_Udp_RcvbufErrors"}'
- '{__name__="node_netstat_Udp_SndbufErrors"}'
- '{__name__="node_network_carrier"}'
- '{__name__="node_network_mtu_bytes"}'
- '{__name__="node_network_receive_bytes_total"}'
- '{__name__="node_network_receive_compressed_total"}'
- '{__name__="node_network_receive_drop_total"}'
- '{__name__="node_network_receive_errs_total"}'
- '{__name__="node_network_receive_fifo_total"}'
- '{__name__="node_network_receive_frame_total"}'
- '{__name__="node_network_receive_multicast_total"}'
- '{__name__="node_network_receive_packets_total"}'
- '{__name__="node_network_speed_bytes"}'
- '{__name__="node_network_transmit_bytes_total"}'
- '{__name__="node_network_transmit_carrier_total"}'
- '{__name__="node_network_transmit_colls_total"}'
- '{__name__="node_network_transmit_compressed_total"}'
- '{__name__="node_network_transmit_drop_total"}'
- '{__name__="node_network_transmit_errs_total"}'
- '{__name__="node_network_transmit_fifo_total"}'
- '{__name__="node_network_transmit_packets_total"}'
- '{__name__="node_network_transmit_queue_length"}'
- '{__name__="node_network_up"}'
- '{__name__="node_nf_conntrack_entries"}'
- '{__name__="node_nf_conntrack_entries_limit"}'
- '{__name__="node_power_supply_online"}'
- '{__name__="node_processes_max_processes"}'
- '{__name__="node_processes_max_threads"}'
- '{__name__="node_processes_pids"}'
- '{__name__="node_processes_state"}'
- '{__name__="node_processes_threads"}'
- '{__name__="node_procs_blocked"}'
- '{__name__="node_procs_running"}'
- '{__name__="node_schedstat_running_seconds_total"}'
- '{__name__="node_schedstat_timeslices_total"}'
- '{__name__="node_schedstat_waiting_seconds_total"}'
- '{__name__="node_scrape_collector_duration_seconds"}'
- '{__name__="node_scrape_collector_success"}'
- '{__name__="node_sockstat_FRAG_inuse"}'
- '{__name__="node_sockstat_FRAG_memory"}'
- '{__name__="node_sockstat_RAW_inuse"}'
- '{__name__="node_sockstat_TCP_alloc"}'
- '{__name__="node_sockstat_TCP_inuse"}'
- '{__name__="node_sockstat_TCP_mem"}'
- '{__name__="node_sockstat_TCP_mem_bytes"}'
- '{__name__="node_sockstat_TCP_orphan"}'
- '{__name__="node_sockstat_TCP_tw"}'
- '{__name__="node_sockstat_UDPLITE_inuse"}'
- '{__name__="node_sockstat_UDP_inuse"}'
- '{__name__="node_sockstat_UDP_mem"}'
- '{__name__="node_sockstat_UDP_mem_bytes"}'
- '{__name__="node_sockstat_sockets_used"}'
- '{__name__="node_softnet_dropped_total"}'
- '{__name__="node_softnet_processed_total"}'
- '{__name__="node_softnet_times_squeezed_total"}'
- '{__name__="node_systemd_socket_accepted_connections_total"}'
- '{__name__="node_systemd_units"}'
- '{__name__="node_textfile_scrape_error"}'
- '{__name__="node_time_seconds"}'
- '{__name__="node_timex_estimated_error_seconds"}'
- '{__name__="node_timex_frequency_adjustment_ratio"}'
- '{__name__="node_timex_loop_time_constant"}'
- '{__name__="node_timex_maxerror_seconds"}'
- '{__name__="node_timex_offset_seconds"}'
- '{__name__="node_timex_sync_status"}'
- '{__name__="node_timex_tai_offset_seconds"}'
- '{__name__="node_timex_tick_seconds"}'
- '{__name__="node_uname_info"}'
- '{__name__="node_vmstat_oom_kill"}'
- '{__name__="node_vmstat_pgfault"}'
- '{__name__="node_vmstat_pgmajfault"}'
- '{__name__="node_vmstat_pgpgin"}'
- '{__name__="node_vmstat_pgpgout"}'
- '{__name__="node_vmstat_pswpin"}'
- '{__name__="node_vmstat_pswpout"}'
- '{__name__="process_cpu_seconds_total"}'
- '{__name__="process_max_fds"}'
- '{__name__="process_open_fds"}'
- '{__name__="process_resident_memory_max_bytes"}'
- '{__name__="process_virtual_memory_bytes"}'
- '{__name__="process_virtual_memory_max_bytes"}'
static_configs:
- targets:
- 'nginx.wordwisdom.tw'
basic_auth:
username: "user"
password: "password"
# Consul for proxy nodes
- job_name: 'proxy_node_exporter'
scrape_interval: 1m
metrics_path: /system_info/hardware_metrics
consul_sd_configs:
- server: 'consul.service.tw'
datacenter: 'dc1'
token: '675bb27b-3142-6308-6146-969a14fb7dd4'
services:
- proxy-node-exporter
basic_auth:
username: "user"
password: "password"
relabel_configs:
- source_labels: [ '__meta_consul_service_id' ]
replacement: '$1'
target_label: hostname
remote_write:
- url: 'http://influxdb:8086/api/v1/prom/write?db=prometheus&u=ibdo&p=ibdo2018'
remote_read:
- url: 'http://influxdb:8086/api/v1/prom/read?db=prometheus&u=ibdo&p=ibdo2018'
```
### Ingress-controller
> https://hackmd.io/@willy83310/rkcaMtDwd
## reference
[**kube-prometheus**](https://github.com/prometheus-operator/kube-prometheus)