---
# System prepended metadata

title: k8s監控方案
tags: [prometheus, ' WAMS']

---

---
tags: prometheus, WAMS
---

# k8s監控方案

[TOC]

![](https://i.imgur.com/1yWytDd.jpg)

## 主要元件介紹
### Prometheus-operator
> github: https://github.com/prometheus-operator/prometheus-operator

**Prometheus Operator**是**CoreOS**開源的一套用於管理在 Kubernetes上的Prometheus控制器，利用custom，目標是<font color='red'>**簡化部署與維護 Prometheus**</font>上的事情。

:::spoiler Prometheus-operator有以下特點
1. **Kubernetes Custom Resources**: Use Kubernetes <font color='red'>custom resources</font> to deploy and manage Prometheus, Alertmanager, and related components.

2. **Simplified Deployment Configuration**: Configure the fundamentals of Prometheus like versions, persistence, retention policies, and replicas from a native Kubernetes resource.

3. **Prometheus Target Configuration**: Automatically generate monitoring target configurations based on familiar Kubernetes label queries; no need to learn a Prometheus specific configuration language.
:::


:::spoiler Prometheus-operator包含以下主要元件:
* `Prometheus`: which defines a desired Prometheus deployment.

* `Alertmanager`: which defines a desired Alertmanager deployment.

* `ThanosRuler`: which defines a desired Thanos Ruler deployment.

* `ServiceMonitor`: which declaratively specifies how groups of Kubernetes **services** should be monitored. The Operator <font color='red'>**automatically generates Prometheus scrape configuration**</font> based on the current state of the objects in the API server.

* `PodMonitor`: which declaratively specifies how group of **pods** should be monitored. The Operator <font color='red'>**automatically generates Prometheus scrape configuration**</font> based on the current state of the objects in the API server.

* `Probe`: which declaratively specifies how groups of ingresses or static targets should be monitored. The Operator automatically generates Prometheus scrape configuration based on the definition.

* `PrometheusRule`: which defines a desired set of Prometheus alerting and/or recording rules. The Operator generates a rule file, which can be used by Prometheus instances.

* `AlertmanagerConfig`: which declaratively specifies subsections of the Alertmanager configuration, allowing routing of alerts to custom receivers, and setting inhibit rules.
:::
---
### Prometheus
> github: https://github.com/prometheus/prometheus

![image alt](https://camo.githubusercontent.com/f14ac82eda765733a5f2b5200d78b4ca84b62559d17c9835068423b223588939/68747470733a2f2f63646e2e6a7364656c6976722e6e65742f67682f70726f6d6574686575732f70726f6d65746865757340633334323537643036396336333036383564613335626365663038343633326666643564363230392f646f63756d656e746174696f6e2f696d616765732f6172636869746563747572652e737667)

:::spoiler Prometheus與其他監控系統的差別
1. A multi-dimensional data model (time series defined by metric name and set of key/value dimensions)
2. <font color='red'>PromQL</font>, a powerful and flexible query language to leverage this dimensionality
3. No dependency on distributed storage; single server nodes are autonomous
4. An <font color='red'>HTTP pull model</font> for time series collection
5. Pushing time series is supported via an intermediary gateway(<font color='red'>push gateway</font>) for batch jobs
6. Targets are discovered via service discovery(EX: **consul**) or static configuration(**prometheus.yml**)
7. Multiple modes of graphing and dashboarding support
8. Support for hierarchical and horizontal <font color='red'>federation</font>

:::



---
### Node-exporter
> github: https://github.com/prometheus/node_exporter

Prometheus exporter for **hardware and OS metrics exposed** by *NIX kernels, written in Go with pluggable metric collectors.

---
### Kube-stat-metrics
> github: https://github.com/kubernetes/kube-state-metrics
> more exposed metrics: https://github.com/kubernetes/kube-state-metrics/tree/master/docs

kube-state-metrics is a simple service that **listens to the Kubernetes API server** and generates metrics(EX: `pod state`, `container state`, `endpoints`, `service`) about the state of the objects.


---
### Prometheus-Adapter
> github: https://github.com/kubernetes-sigs/prometheus-adapter

This repository contains an implementation of the Kubernetes [resource metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/resource-metrics-api.md), [custom metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md), and [external metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/external-metrics-api.md) APIs.

---
### Ingress
#### Ingress resource
主要制定routing rule，主要的設定會落在spec，以及依賴底下實作不同，額外設定的annotation。

#### Ingress controller
> Ingress controller種類: https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/#additional-controllers

負責實做底層的服務，再依照Ingress resource的設定去動態修改nginx pod裡的設定。

---
## 安裝流程
### Kube-prometheus
:::spoiler This project included in this package:
* Prometheus Operator
* Highly available Prometheus
* Highly available Alertmanager(建在本地端，gke不會建此服務)
* Prometheus node-exporter
* Prometheus Adapter for Kubernetes Metrics APIs
* Kube-state-metrics
* Grafana(建在本地端，gke不會建此服務)
:::

<br>

1. Check Prerequisites
kubelet configuration must contain these flags:
* `--authentication-token-webhook=true`
* `--authorization-mode=Webhook`

2. Check Compatibility

| kube-prometheus stack | Kubernetes 1.16 | Kubernetes 1.17 | Kubernetes 1.18 | Kubernetes 1.19 | Kubernetes 1.20 |
| --------------------- | --------------- | --------------- | --------------- | --------------- | --------------- |
| `release-0.4`         | ✔ (v1.16.5+)    | ✔               | ✗               | ✗               | ✗               |
| `release-0.5`         | ✗               | ✗               | ✔               | ✗               | ✗               |
| `release-0.6`         | ✗               | ✗               | ✗               | ✔               | ✗               |
| `release-0.7`         | ✗               | ✗               | ✗               | ✔               | ✔               |
| `HEAD`                | ✗               | ✗               | ✗               | ✔               | ✔               |

3. Clone project kube-prometheus
```bash=
$ git clone https://github.com/prometheus-operator/kube-prometheus.git
```

4. Checkout project version according to Compatibility
```bash=
$ git checkout <tag or commit SHA>
```

5. Remove grafana、alert manager yaml
```bash=
$ cd kube-prometheus
$ rm -f manifests/alertmanager-*
$ rm -f manifests/grafana-*
```

6. Create the monitoring stack using the config in the manifests directory
```bash= !
# Create the namespace and CRDs, and then wait for them to be available before creating the remaining resources
$ kubectl create -f manifests/setup
$ until kubectl get servicemonitors --all-namespaces ;do date; sleep 1; echo ""; done
$ kubectl create -f manifests/
```

To teardown the stack:
```bash=
$ kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup
```

7. (TEST，optional)Access the dashboards by port-forwarding

* Prometheus

    ```bash !
    $ kubectl --namespace monitoring port-forward         svc/prometheus-k8s 9090
    ```
    Then access via http://{HOST_IP}:9090

* Grafana

    ```bash !
    $ kubectl --namespace monitoring port-forward svc/grafana 3000
    ```
    Then access via http://localhost:3000 and use the default grafana `user:password` of `admin:admin`.


### Consul
> https://hackmd.io/Uew-BqRWQz6qxSix47SMkg

### Prometheus
> prometheus stacks: http://ibdo.efoxconn.com:5000/QAD/prometheus-stack

docker-compose.yml
```yaml= !
version: "3.1"
services:
  grafana:
    image: grafana/grafana:7.3.5
    container_name: grafana
    ports:
      - '3000:3000'
    environment:
      - GF_SECURITY_ADMIN_USER=ibdo # grafana user
      - GF_SECURITY_ADMIN_PASSWORD=ibdo2018 # grafana password
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning/:/etc/grafana/provisioning/ # 放grafana dashborad json
    restart: always
    depends_on:
      - prometheus
  prometheus:
    image: prom/prometheus:v2.23.0
    container_name: prometheus
    ports:
      - '9090:9090'
    command:
      - '--config.file=/etc/prometheus/prometheus.yml' # 指定prometheus config path
      - '--storage.tsdb.retention.time=2h' # 只保留兩個小時的資料
      - '--web.enable-lifecycle' # 可以使用post去重新讀取設定
    volumes:
      - ./prometheus/config/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    restart: always
    depends_on:
      - influxdb
  influxdb:
    image: influxdb:1.8.3
    container_name: influxdb
    ports:
      - '8086:8086'
    environment:
      - INFLUXDB_DB=prometheus
      - INFLUXDB_ADMIN_USER=ibdo
      - INFLUXDB_ADMIN_PASSWORD=ibdo2018
      - INFLUXDB_DATA_MAX_SERIES_PER_DATABASE=0 # The maximum number of series allowed per database before writes are dropped.
    restart: always
    volumes:
      - influxdb_data:/var/lib/influxdb

volumes:
  prometheus_data: {}
  grafana_data: {}
  influxdb_data: {}
```

prometheus.yml
```yaml= 
global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']

  # GKE Kube-prometheus
  - job_name: 'federate'
    scrape_interval: 5m
    scrape_timeout: 3m
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        # kube_prometheus_metrics
        - '{__name__="cass_jvm_heap"}'
        - '{__name__="cass_jvm_heap_max"}'
        - '{__name__="cass_jvm_noheap"}'
        - '{__name__="cass_jvm_noheap_max"}'
        - '{__name__="container_cpu_usage_seconds_total"}'
        - '{__name__="container_fs_limit_bytes"}'
        - '{__name__="container_fs_usage_bytes"}'
        - '{__name__="container_memory_rss"}'
        - '{__name__="container_memory_working_set_bytes"}'
        - '{__name__="container_network_receive_bytes_total"}'
        - '{__name__="container_network_transmit_bytes_total"}'
        - '{__name__="kube_configmap_info"}'
        - '{__name__="kube_namespace_labels"}'
        - '{__name__="kube_node_info"}'
        - '{__name__="kube_node_status_allocatable_cpu_cores"}'
        - '{__name__="kube_node_status_capacity_cpu_cores"}'
        - '{__name__="kube_node_status_capacity_memory_bytes"}'
        - '{__name__="kube_node_status_capacity_pods"}'
        - '{__name__="kube_node_status_condition"}'
        - '{__name__="kube_pod_container_info"}'
        - '{__name__="kube_pod_container_resource_limits_cpu_cores"}'
        - '{__name__="kube_pod_container_resource_limits_memory_bytes"}'
        - '{__name__="kube_pod_container_resource_requests_cpu_cores"}'
        - '{__name__="kube_pod_container_resource_requests_memory_bytes"}'
        - '{__name__="kube_pod_container_status_restarts_total"}'
        - '{__name__="kube_pod_info"}'
        - '{__name__="kube_pod_status_phase"}'
        - '{__name__="kube_secret_info"}'
        - '{__name__="kube_service_info"}'
        - '{__name__="machine_cpu_cores"}'
        - '{__name__="machine_memory_bytes"}'
        - '{__name__="origin_prometheus"}'
        # node_exporter_full_metrics
        - '{__name__="node_arp_entries"}'
        - '{__name__="node_context_switches_total"}'
        - '{__name__="node_cooling_device_cur_state"}'
        - '{__name__="node_cooling_device_max_state"}'
        - '{__name__="node_cpu_seconds_total"}'
        - '{__name__="node_disk_discard_time_seconds_total"}'
        - '{__name__="node_disk_discards_completed_total"}'
        - '{__name__="node_disk_discards_merged_total"}'
        - '{__name__="node_disk_io_now"}'
        - '{__name__="node_disk_io_time_seconds_total"}'
        - '{__name__="node_disk_io_time_weighted_seconds_total"}'
        - '{__name__="node_disk_read_bytes_total"}'
        - '{__name__="node_disk_read_time_seconds_total"}'
        - '{__name__="node_disk_reads_completed_total"}'
        - '{__name__="node_disk_reads_merged_total"}'
        - '{__name__="node_disk_write_time_seconds_total"}'
        - '{__name__="node_disk_writes_completed_total"}'
        - '{__name__="node_disk_writes_merged_total"}'
        - '{__name__="node_disk_written_bytes_total"}'
        - '{__name__="node_entropy_available_bits"}'
        - '{__name__="node_filefd_allocated"}'
        - '{__name__="node_filefd_maximum"}'
        - '{__name__="node_filesystem_avail_bytes"}'
        - '{__name__="node_filesystem_device_error"}'
        - '{__name__="node_filesystem_files"}'
        - '{__name__="node_filesystem_files_free"}'
        - '{__name__="node_filesystem_free_bytes"}'
        - '{__name__="node_filesystem_readonly"}'
        - '{__name__="node_filesystem_size_bytes"}'
        - '{__name__="node_forks_total"}'
        - '{__name__="node_hwmon_temp_celsius"}'
        - '{__name__="node_hwmon_temp_crit_alarm_celsius"}'
        - '{__name__="node_hwmon_temp_crit_celsius"}'
        - '{__name__="node_hwmon_temp_crit_hyst_celsius"}'
        - '{__name__="node_hwmon_temp_max_celsius"}'
        - '{__name__="node_interrupts_total"}'
        - '{__name__="node_intr_total"}'
        - '{__name__="node_load1"}'
        - '{__name__="node_load15"}'
        - '{__name__="node_load5"}'
        - '{__name__="node_memory_Active_anon_bytes"}'
        - '{__name__="node_memory_Active_bytes"}'
        - '{__name__="node_memory_Active_file_bytes"}'
        - '{__name__="node_memory_AnonHugePages_bytes"}'
        - '{__name__="node_memory_AnonPages_bytes"}'
        - '{__name__="node_memory_Bounce_bytes"}'
        - '{__name__="node_memory_Buffers_bytes"}'
        - '{__name__="node_memory_Cached_bytes"}'
        - '{__name__="node_memory_CommitLimit_bytes"}'
        - '{__name__="node_memory_Committed_AS_bytes"}'
        - '{__name__="node_memory_DirectMap1G_bytes"}'
        - '{__name__="node_memory_DirectMap2M_bytes"}'
        - '{__name__="node_memory_DirectMap4k_bytes"}'
        - '{__name__="node_memory_Dirty_bytes"}'
        - '{__name__="node_memory_HardwareCorrupted_bytes"}'
        - '{__name__="node_memory_HugePages_Free"}'
        - '{__name__="node_memory_HugePages_Rsvd"}'
        - '{__name__="node_memory_HugePages_Surp"}'
        - '{__name__="node_memory_HugePages_Total"}'
        - '{__name__="node_memory_Hugepagesize_bytes"}'
        - '{__name__="node_memory_Inactive_anon_bytes"}'
        - '{__name__="node_memory_Inactive_bytes"}'
        - '{__name__="node_memory_Inactive_file_bytes"}'
        - '{__name__="node_memory_KernelStack_bytes"}'
        - '{__name__="node_memory_Mapped_bytes"}'
        - '{__name__="node_memory_MemFree_bytes"}'
        - '{__name__="node_memory_MemTotal_bytes"}'
        - '{__name__="node_memory_Mlocked_bytes"}'
        - '{__name__="node_memory_NFS_Unstable_bytes"}'
        - '{__name__="node_memory_PageTables_bytes"}'
        - '{__name__="node_memory_Percpu_bytes"}'
        - '{__name__="node_memory_SReclaimable_bytes"}'
        - '{__name__="node_memory_SUnreclaim_bytes"}'
        - '{__name__="node_memory_ShmemHugePages_bytes"}'
        - '{__name__="node_memory_ShmemPmdMapped_bytes"}'
        - '{__name__="node_memory_Shmem_bytes"}'
        - '{__name__="node_memory_Slab_bytes"}'
        - '{__name__="node_memory_SwapCached_bytes"}'
        - '{__name__="node_memory_SwapTotal_bytes"}'
        - '{__name__="node_memory_Unevictable_bytes"}'
        - '{__name__="node_memory_VmallocChunk_bytes"}'
        - '{__name__="node_memory_VmallocTotal_bytes"}'
        - '{__name__="node_memory_VmallocUsed_bytes"}'
        - '{__name__="node_memory_WritebackTmp_bytes"}'
        - '{__name__="node_memory_Writeback_bytes"}'
        - '{__name__="node_netstat_Icmp_InErrors"}'
        - '{__name__="node_netstat_Icmp_InMsgs"}'
        - '{__name__="node_netstat_Icmp_OutMsgs"}'
        - '{__name__="node_netstat_IpExt_InOctets"}'
        - '{__name__="node_netstat_IpExt_OutOctets"}'
        - '{__name__="node_netstat_Ip_Forwarding"}'
        - '{__name__="node_netstat_TcpExt_ListenDrops"}'
        - '{__name__="node_netstat_TcpExt_ListenOverflows"}'
        - '{__name__="node_netstat_TcpExt_SyncookiesFailed"}'
        - '{__name__="node_netstat_TcpExt_SyncookiesRecv"}'
        - '{__name__="node_netstat_TcpExt_SyncookiesSent"}'
        - '{__name__="node_netstat_TcpExt_TCPSynRetrans"}'
        - '{__name__="node_netstat_Tcp_ActiveOpens"}'
        - '{__name__="node_netstat_Tcp_CurrEstab"}'
        - '{__name__="node_netstat_Tcp_InErrs"}'
        - '{__name__="node_netstat_Tcp_InSegs"}'
        - '{__name__="node_netstat_Tcp_MaxConn"}'
        - '{__name__="node_netstat_Tcp_OutSegs"}'
        - '{__name__="node_netstat_Tcp_PassiveOpens"}'
        - '{__name__="node_netstat_Tcp_RetransSegs"}'
        - '{__name__="node_netstat_UdpLite_InErrors"}'
        - '{__name__="node_netstat_Udp_InDatagrams"}'
        - '{__name__="node_netstat_Udp_InErrors"}'
        - '{__name__="node_netstat_Udp_NoPorts"}'
        - '{__name__="node_netstat_Udp_OutDatagrams"}'
        - '{__name__="node_netstat_Udp_RcvbufErrors"}'
        - '{__name__="node_netstat_Udp_SndbufErrors"}'
        - '{__name__="node_network_carrier"}'
        - '{__name__="node_network_mtu_bytes"}'
        - '{__name__="node_network_receive_bytes_total"}'
        - '{__name__="node_network_receive_compressed_total"}'
        - '{__name__="node_network_receive_drop_total"}'
        - '{__name__="node_network_receive_errs_total"}'
        - '{__name__="node_network_receive_fifo_total"}'
        - '{__name__="node_network_receive_frame_total"}'
        - '{__name__="node_network_receive_multicast_total"}'
        - '{__name__="node_network_receive_packets_total"}'
        - '{__name__="node_network_speed_bytes"}'
        - '{__name__="node_network_transmit_bytes_total"}'
        - '{__name__="node_network_transmit_carrier_total"}'
        - '{__name__="node_network_transmit_colls_total"}'
        - '{__name__="node_network_transmit_compressed_total"}'
        - '{__name__="node_network_transmit_drop_total"}'
        - '{__name__="node_network_transmit_errs_total"}'
        - '{__name__="node_network_transmit_fifo_total"}'
        - '{__name__="node_network_transmit_packets_total"}'
        - '{__name__="node_network_transmit_queue_length"}'
        - '{__name__="node_network_up"}'
        - '{__name__="node_nf_conntrack_entries"}'
        - '{__name__="node_nf_conntrack_entries_limit"}'
        - '{__name__="node_power_supply_online"}'
        - '{__name__="node_processes_max_processes"}'
        - '{__name__="node_processes_max_threads"}'
        - '{__name__="node_processes_pids"}'
        - '{__name__="node_processes_state"}'
        - '{__name__="node_processes_threads"}'
        - '{__name__="node_procs_blocked"}'
        - '{__name__="node_procs_running"}'
        - '{__name__="node_schedstat_running_seconds_total"}'
        - '{__name__="node_schedstat_timeslices_total"}'
        - '{__name__="node_schedstat_waiting_seconds_total"}'
        - '{__name__="node_scrape_collector_duration_seconds"}'
        - '{__name__="node_scrape_collector_success"}'
        - '{__name__="node_sockstat_FRAG_inuse"}'
        - '{__name__="node_sockstat_FRAG_memory"}'
        - '{__name__="node_sockstat_RAW_inuse"}'
        - '{__name__="node_sockstat_TCP_alloc"}'
        - '{__name__="node_sockstat_TCP_inuse"}'
        - '{__name__="node_sockstat_TCP_mem"}'
        - '{__name__="node_sockstat_TCP_mem_bytes"}'
        - '{__name__="node_sockstat_TCP_orphan"}'
        - '{__name__="node_sockstat_TCP_tw"}'
        - '{__name__="node_sockstat_UDPLITE_inuse"}'
        - '{__name__="node_sockstat_UDP_inuse"}'
        - '{__name__="node_sockstat_UDP_mem"}'
        - '{__name__="node_sockstat_UDP_mem_bytes"}'
        - '{__name__="node_sockstat_sockets_used"}'
        - '{__name__="node_softnet_dropped_total"}'
        - '{__name__="node_softnet_processed_total"}'
        - '{__name__="node_softnet_times_squeezed_total"}'
        - '{__name__="node_systemd_socket_accepted_connections_total"}'
        - '{__name__="node_systemd_units"}'
        - '{__name__="node_textfile_scrape_error"}'
        - '{__name__="node_time_seconds"}'
        - '{__name__="node_timex_estimated_error_seconds"}'
        - '{__name__="node_timex_frequency_adjustment_ratio"}'
        - '{__name__="node_timex_loop_time_constant"}'
        - '{__name__="node_timex_maxerror_seconds"}'
        - '{__name__="node_timex_offset_seconds"}'
        - '{__name__="node_timex_sync_status"}'
        - '{__name__="node_timex_tai_offset_seconds"}'
        - '{__name__="node_timex_tick_seconds"}'
        - '{__name__="node_uname_info"}'
        - '{__name__="node_vmstat_oom_kill"}'
        - '{__name__="node_vmstat_pgfault"}'
        - '{__name__="node_vmstat_pgmajfault"}'
        - '{__name__="node_vmstat_pgpgin"}'
        - '{__name__="node_vmstat_pgpgout"}'
        - '{__name__="node_vmstat_pswpin"}'
        - '{__name__="node_vmstat_pswpout"}'
        - '{__name__="process_cpu_seconds_total"}'
        - '{__name__="process_max_fds"}'
        - '{__name__="process_open_fds"}'
        - '{__name__="process_resident_memory_max_bytes"}'
        - '{__name__="process_virtual_memory_bytes"}'
        - '{__name__="process_virtual_memory_max_bytes"}'
    static_configs:
      - targets:
        - 'nginx.wordwisdom.tw'
    basic_auth:
      username: "user"
      password: "password"

  # Consul for proxy nodes
  - job_name: 'proxy_node_exporter'
    scrape_interval: 1m
    metrics_path: /system_info/hardware_metrics
    consul_sd_configs:
    - server: 'consul.service.tw'
      datacenter: 'dc1'
      token: '675bb27b-3142-6308-6146-969a14fb7dd4'
      services:
        - proxy-node-exporter
    basic_auth:
      username: "user"
      password: "password"
    relabel_configs:
        - source_labels: [ '__meta_consul_service_id' ]
          replacement: '$1'
          target_label: hostname

remote_write:
  - url: 'http://influxdb:8086/api/v1/prom/write?db=prometheus&u=ibdo&p=ibdo2018'
remote_read:
  - url: 'http://influxdb:8086/api/v1/prom/read?db=prometheus&u=ibdo&p=ibdo2018'
```

### Ingress-controller
> https://hackmd.io/@willy83310/rkcaMtDwd

## reference
[**kube-prometheus**](https://github.com/prometheus-operator/kube-prometheus)