k8s監控方案

--- tags: prometheus, WAMS --- # k8s監控方案 [TOC] ![](https://i.imgur.com/1yWytDd.jpg) ## 主要元件介紹 ### Prometheus-operator > github: https://github.com/prometheus-operator/prometheus-operator **Prometheus Operator**是**CoreOS**開源的一套用於管理在 Kubernetes上的Prometheus控制器，利用custom，目標是**簡化部署與維護 Prometheus**上的事情。 :::spoiler Prometheus-operator有以下特點 1. **Kubernetes Custom Resources**: Use Kubernetes custom resources to deploy and manage Prometheus, Alertmanager, and related components. 2. **Simplified Deployment Configuration**: Configure the fundamentals of Prometheus like versions, persistence, retention policies, and replicas from a native Kubernetes resource. 3. **Prometheus Target Configuration**: Automatically generate monitoring target configurations based on familiar Kubernetes label queries; no need to learn a Prometheus specific configuration language. ::: :::spoiler Prometheus-operator包含以下主要元件: * `Prometheus`: which defines a desired Prometheus deployment. * `Alertmanager`: which defines a desired Alertmanager deployment. * `ThanosRuler`: which defines a desired Thanos Ruler deployment. * `ServiceMonitor`: which declaratively specifies how groups of Kubernetes **services** should be monitored. The Operator **automatically generates Prometheus scrape configuration** based on the current state of the objects in the API server. * `PodMonitor`: which declaratively specifies how group of **pods** should be monitored. The Operator **automatically generates Prometheus scrape configuration** based on the current state of the objects in the API server. * `Probe`: which declaratively specifies how groups of ingresses or static targets should be monitored. The Operator automatically generates Prometheus scrape configuration based on the definition. * `PrometheusRule`: which defines a desired set of Prometheus alerting and/or recording rules. The Operator generates a rule file, which can be used by Prometheus instances. * `AlertmanagerConfig`: which declaratively specifies subsections of the Alertmanager configuration, allowing routing of alerts to custom receivers, and setting inhibit rules. ::: --- ### Prometheus > github: https://github.com/prometheus/prometheus ![image alt](https://camo.githubusercontent.com/f14ac82eda765733a5f2b5200d78b4ca84b62559d17c9835068423b223588939/68747470733a2f2f63646e2e6a7364656c6976722e6e65742f67682f70726f6d6574686575732f70726f6d65746865757340633334323537643036396336333036383564613335626365663038343633326666643564363230392f646f63756d656e746174696f6e2f696d616765732f6172636869746563747572652e737667) :::spoiler Prometheus與其他監控系統的差別 1. A multi-dimensional data model (time series defined by metric name and set of key/value dimensions) 2. PromQL, a powerful and flexible query language to leverage this dimensionality 3. No dependency on distributed storage; single server nodes are autonomous 4. An HTTP pull model for time series collection 5. Pushing time series is supported via an intermediary gateway(push gateway) for batch jobs 6. Targets are discovered via service discovery(EX: **consul**) or static configuration(**prometheus.yml**) 7. Multiple modes of graphing and dashboarding support 8. Support for hierarchical and horizontal federation ::: --- ### Node-exporter > github: https://github.com/prometheus/node_exporter Prometheus exporter for **hardware and OS metrics exposed** by *NIX kernels, written in Go with pluggable metric collectors. --- ### Kube-stat-metrics > github: https://github.com/kubernetes/kube-state-metrics > more exposed metrics: https://github.com/kubernetes/kube-state-metrics/tree/master/docs kube-state-metrics is a simple service that **listens to the Kubernetes API server** and generates metrics(EX: `pod state`, `container state`, `endpoints`, `service`) about the state of the objects. --- ### Prometheus-Adapter > github: https://github.com/kubernetes-sigs/prometheus-adapter This repository contains an implementation of the Kubernetes [resource metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/resource-metrics-api.md), [custom metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md), and [external metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/external-metrics-api.md) APIs. --- ### Ingress #### Ingress resource 主要制定routing rule，主要的設定會落在spec，以及依賴底下實作不同，額外設定的annotation。 #### Ingress controller > Ingress controller種類: https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/#additional-controllers 負責實做底層的服務，再依照Ingress resource的設定去動態修改nginx pod裡的設定。 --- ## 安裝流程 ### Kube-prometheus :::spoiler This project included in this package: * Prometheus Operator * Highly available Prometheus * Highly available Alertmanager(建在本地端，gke不會建此服務) * Prometheus node-exporter * Prometheus Adapter for Kubernetes Metrics APIs * Kube-state-metrics * Grafana(建在本地端，gke不會建此服務) ::: 1. Check Prerequisites kubelet configuration must contain these flags: * `--authentication-token-webhook=true` * `--authorization-mode=Webhook` 2. Check Compatibility | kube-prometheus stack | Kubernetes 1.16 | Kubernetes 1.17 | Kubernetes 1.18 | Kubernetes 1.19 | Kubernetes 1.20 | | --------------------- | --------------- | --------------- | --------------- | --------------- | --------------- | | `release-0.4` | ✔ (v1.16.5+) | ✔ | ✗ | ✗ | ✗ | | `release-0.5` | ✗ | ✗ | ✔ | ✗ | ✗ | | `release-0.6` | ✗ | ✗ | ✗ | ✔ | ✗ | | `release-0.7` | ✗ | ✗ | ✗ | ✔ | ✔ | | `HEAD` | ✗ | ✗ | ✗ | ✔ | ✔ | 3. Clone project kube-prometheus ```bash= $ git clone https://github.com/prometheus-operator/kube-prometheus.git ``` 4. Checkout project version according to Compatibility ```bash= $ git checkout <tag or commit SHA> ``` 5. Remove grafana、alert manager yaml ```bash= $ cd kube-prometheus $ rm -f manifests/alertmanager-* $ rm -f manifests/grafana-* ``` 6. Create the monitoring stack using the config in the manifests directory ```bash= ! # Create the namespace and CRDs, and then wait for them to be available before creating the remaining resources $ kubectl create -f manifests/setup $ until kubectl get servicemonitors --all-namespaces ;do date; sleep 1; echo ""; done $ kubectl create -f manifests/ ``` To teardown the stack: ```bash= $ kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup ``` 7. (TEST，optional)Access the dashboards by port-forwarding * Prometheus ```bash ! $ kubectl --namespace monitoring port-forward svc/prometheus-k8s 9090 ``` Then access via http://{HOST_IP}:9090 * Grafana ```bash ! $ kubectl --namespace monitoring port-forward svc/grafana 3000 ``` Then access via http://localhost:3000 and use the default grafana `user:password` of `admin:admin`. ### Consul > https://hackmd.io/Uew-BqRWQz6qxSix47SMkg ### Prometheus > prometheus stacks: http://ibdo.efoxconn.com:5000/QAD/prometheus-stack docker-compose.yml ```yaml= ! version: "3.1" services: grafana: image: grafana/grafana:7.3.5 container_name: grafana ports: - '3000:3000' environment: - GF_SECURITY_ADMIN_USER=ibdo # grafana user - GF_SECURITY_ADMIN_PASSWORD=ibdo2018 # grafana password volumes: - grafana_data:/var/lib/grafana - ./grafana/provisioning/:/etc/grafana/provisioning/ # 放grafana dashborad json restart: always depends_on: - prometheus prometheus: image: prom/prometheus:v2.23.0 container_name: prometheus ports: - '9090:9090' command: - '--config.file=/etc/prometheus/prometheus.yml' # 指定prometheus config path - '--storage.tsdb.retention.time=2h' # 只保留兩個小時的資料 - '--web.enable-lifecycle' # 可以使用post去重新讀取設定 volumes: - ./prometheus/config/prometheus.yml:/etc/prometheus/prometheus.yml - prometheus_data:/prometheus restart: always depends_on: - influxdb influxdb: image: influxdb:1.8.3 container_name: influxdb ports: - '8086:8086' environment: - INFLUXDB_DB=prometheus - INFLUXDB_ADMIN_USER=ibdo - INFLUXDB_ADMIN_PASSWORD=ibdo2018 - INFLUXDB_DATA_MAX_SERIES_PER_DATABASE=0 # The maximum number of series allowed per database before writes are dropped. restart: always volumes: - influxdb_data:/var/lib/influxdb volumes: prometheus_data: {} grafana_data: {} influxdb_data: {} ``` prometheus.yml ```yaml= global: scrape_interval: 15s # By default, scrape targets every 15 seconds. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' scrape_interval: 5s static_configs: - targets: ['localhost:9090'] # GKE Kube-prometheus - job_name: 'federate' scrape_interval: 5m scrape_timeout: 3m honor_labels: true metrics_path: '/federate' params: 'match[]': # kube_prometheus_metrics - '{__name__="cass_jvm_heap"}' - '{__name__="cass_jvm_heap_max"}' - '{__name__="cass_jvm_noheap"}' - '{__name__="cass_jvm_noheap_max"}' - '{__name__="container_cpu_usage_seconds_total"}' - '{__name__="container_fs_limit_bytes"}' - '{__name__="container_fs_usage_bytes"}' - '{__name__="container_memory_rss"}' - '{__name__="container_memory_working_set_bytes"}' - '{__name__="container_network_receive_bytes_total"}' - '{__name__="container_network_transmit_bytes_total"}' - '{__name__="kube_configmap_info"}' - '{__name__="kube_namespace_labels"}' - '{__name__="kube_node_info"}' - '{__name__="kube_node_status_allocatable_cpu_cores"}' - '{__name__="kube_node_status_capacity_cpu_cores"}' - '{__name__="kube_node_status_capacity_memory_bytes"}' - '{__name__="kube_node_status_capacity_pods"}' - '{__name__="kube_node_status_condition"}' - '{__name__="kube_pod_container_info"}' - '{__name__="kube_pod_container_resource_limits_cpu_cores"}' - '{__name__="kube_pod_container_resource_limits_memory_bytes"}' - '{__name__="kube_pod_container_resource_requests_cpu_cores"}' - '{__name__="kube_pod_container_resource_requests_memory_bytes"}' - '{__name__="kube_pod_container_status_restarts_total"}' - '{__name__="kube_pod_info"}' - '{__name__="kube_pod_status_phase"}' - '{__name__="kube_secret_info"}' - '{__name__="kube_service_info"}' - '{__name__="machine_cpu_cores"}' - '{__name__="machine_memory_bytes"}' - '{__name__="origin_prometheus"}' # node_exporter_full_metrics - '{__name__="node_arp_entries"}' - '{__name__="node_context_switches_total"}' - '{__name__="node_cooling_device_cur_state"}' - '{__name__="node_cooling_device_max_state"}' - '{__name__="node_cpu_seconds_total"}' - '{__name__="node_disk_discard_time_seconds_total"}' - '{__name__="node_disk_discards_completed_total"}' - '{__name__="node_disk_discards_merged_total"}' - '{__name__="node_disk_io_now"}' - '{__name__="node_disk_io_time_seconds_total"}' - '{__name__="node_disk_io_time_weighted_seconds_total"}' - '{__name__="node_disk_read_bytes_total"}' - '{__name__="node_disk_read_time_seconds_total"}' - '{__name__="node_disk_reads_completed_total"}' - '{__name__="node_disk_reads_merged_total"}' - '{__name__="node_disk_write_time_seconds_total"}' - '{__name__="node_disk_writes_completed_total"}' - '{__name__="node_disk_writes_merged_total"}' - '{__name__="node_disk_written_bytes_total"}' - '{__name__="node_entropy_available_bits"}' - '{__name__="node_filefd_allocated"}' - '{__name__="node_filefd_maximum"}' - '{__name__="node_filesystem_avail_bytes"}' - '{__name__="node_filesystem_device_error"}' - '{__name__="node_filesystem_files"}' - '{__name__="node_filesystem_files_free"}' - '{__name__="node_filesystem_free_bytes"}' - '{__name__="node_filesystem_readonly"}' - '{__name__="node_filesystem_size_bytes"}' - '{__name__="node_forks_total"}' - '{__name__="node_hwmon_temp_celsius"}' - '{__name__="node_hwmon_temp_crit_alarm_celsius"}' - '{__name__="node_hwmon_temp_crit_celsius"}' - '{__name__="node_hwmon_temp_crit_hyst_celsius"}' - '{__name__="node_hwmon_temp_max_celsius"}' - '{__name__="node_interrupts_total"}' - '{__name__="node_intr_total"}' - '{__name__="node_load1"}' - '{__name__="node_load15"}' - '{__name__="node_load5"}' - '{__name__="node_memory_Active_anon_bytes"}' - '{__name__="node_memory_Active_bytes"}' - '{__name__="node_memory_Active_file_bytes"}' - '{__name__="node_memory_AnonHugePages_bytes"}' - '{__name__="node_memory_AnonPages_bytes"}' - '{__name__="node_memory_Bounce_bytes"}' - '{__name__="node_memory_Buffers_bytes"}' - '{__name__="node_memory_Cached_bytes"}' - '{__name__="node_memory_CommitLimit_bytes"}' - '{__name__="node_memory_Committed_AS_bytes"}' - '{__name__="node_memory_DirectMap1G_bytes"}' - '{__name__="node_memory_DirectMap2M_bytes"}' - '{__name__="node_memory_DirectMap4k_bytes"}' - '{__name__="node_memory_Dirty_bytes"}' - '{__name__="node_memory_HardwareCorrupted_bytes"}' - '{__name__="node_memory_HugePages_Free"}' - '{__name__="node_memory_HugePages_Rsvd"}' - '{__name__="node_memory_HugePages_Surp"}' - '{__name__="node_memory_HugePages_Total"}' - '{__name__="node_memory_Hugepagesize_bytes"}' - '{__name__="node_memory_Inactive_anon_bytes"}' - '{__name__="node_memory_Inactive_bytes"}' - '{__name__="node_memory_Inactive_file_bytes"}' - '{__name__="node_memory_KernelStack_bytes"}' - '{__name__="node_memory_Mapped_bytes"}' - '{__name__="node_memory_MemFree_bytes"}' - '{__name__="node_memory_MemTotal_bytes"}' - '{__name__="node_memory_Mlocked_bytes"}' - '{__name__="node_memory_NFS_Unstable_bytes"}' - '{__name__="node_memory_PageTables_bytes"}' - '{__name__="node_memory_Percpu_bytes"}' - '{__name__="node_memory_SReclaimable_bytes"}' - '{__name__="node_memory_SUnreclaim_bytes"}' - '{__name__="node_memory_ShmemHugePages_bytes"}' - '{__name__="node_memory_ShmemPmdMapped_bytes"}' - '{__name__="node_memory_Shmem_bytes"}' - '{__name__="node_memory_Slab_bytes"}' - '{__name__="node_memory_SwapCached_bytes"}' - '{__name__="node_memory_SwapTotal_bytes"}' - '{__name__="node_memory_Unevictable_bytes"}' - '{__name__="node_memory_VmallocChunk_bytes"}' - '{__name__="node_memory_VmallocTotal_bytes"}' - '{__name__="node_memory_VmallocUsed_bytes"}' - '{__name__="node_memory_WritebackTmp_bytes"}' - '{__name__="node_memory_Writeback_bytes"}' - '{__name__="node_netstat_Icmp_InErrors"}' - '{__name__="node_netstat_Icmp_InMsgs"}' - '{__name__="node_netstat_Icmp_OutMsgs"}' - '{__name__="node_netstat_IpExt_InOctets"}' - '{__name__="node_netstat_IpExt_OutOctets"}' - '{__name__="node_netstat_Ip_Forwarding"}' - '{__name__="node_netstat_TcpExt_ListenDrops"}' - '{__name__="node_netstat_TcpExt_ListenOverflows"}' - '{__name__="node_netstat_TcpExt_SyncookiesFailed"}' - '{__name__="node_netstat_TcpExt_SyncookiesRecv"}' - '{__name__="node_netstat_TcpExt_SyncookiesSent"}' - '{__name__="node_netstat_TcpExt_TCPSynRetrans"}' - '{__name__="node_netstat_Tcp_ActiveOpens"}' - '{__name__="node_netstat_Tcp_CurrEstab"}' - '{__name__="node_netstat_Tcp_InErrs"}' - '{__name__="node_netstat_Tcp_InSegs"}' - '{__name__="node_netstat_Tcp_MaxConn"}' - '{__name__="node_netstat_Tcp_OutSegs"}' - '{__name__="node_netstat_Tcp_PassiveOpens"}' - '{__name__="node_netstat_Tcp_RetransSegs"}' - '{__name__="node_netstat_UdpLite_InErrors"}' - '{__name__="node_netstat_Udp_InDatagrams"}' - '{__name__="node_netstat_Udp_InErrors"}' - '{__name__="node_netstat_Udp_NoPorts"}' - '{__name__="node_netstat_Udp_OutDatagrams"}' - '{__name__="node_netstat_Udp_RcvbufErrors"}' - '{__name__="node_netstat_Udp_SndbufErrors"}' - '{__name__="node_network_carrier"}' - '{__name__="node_network_mtu_bytes"}' - '{__name__="node_network_receive_bytes_total"}' - '{__name__="node_network_receive_compressed_total"}' - '{__name__="node_network_receive_drop_total"}' - '{__name__="node_network_receive_errs_total"}' - '{__name__="node_network_receive_fifo_total"}' - '{__name__="node_network_receive_frame_total"}' - '{__name__="node_network_receive_multicast_total"}' - '{__name__="node_network_receive_packets_total"}' - '{__name__="node_network_speed_bytes"}' - '{__name__="node_network_transmit_bytes_total"}' - '{__name__="node_network_transmit_carrier_total"}' - '{__name__="node_network_transmit_colls_total"}' - '{__name__="node_network_transmit_compressed_total"}' - '{__name__="node_network_transmit_drop_total"}' - '{__name__="node_network_transmit_errs_total"}' - '{__name__="node_network_transmit_fifo_total"}' - '{__name__="node_network_transmit_packets_total"}' - '{__name__="node_network_transmit_queue_length"}' - '{__name__="node_network_up"}' - '{__name__="node_nf_conntrack_entries"}' - '{__name__="node_nf_conntrack_entries_limit"}' - '{__name__="node_power_supply_online"}' - '{__name__="node_processes_max_processes"}' - '{__name__="node_processes_max_threads"}' - '{__name__="node_processes_pids"}' - '{__name__="node_processes_state"}' - '{__name__="node_processes_threads"}' - '{__name__="node_procs_blocked"}' - '{__name__="node_procs_running"}' - '{__name__="node_schedstat_running_seconds_total"}' - '{__name__="node_schedstat_timeslices_total"}' - '{__name__="node_schedstat_waiting_seconds_total"}' - '{__name__="node_scrape_collector_duration_seconds"}' - '{__name__="node_scrape_collector_success"}' - '{__name__="node_sockstat_FRAG_inuse"}' - '{__name__="node_sockstat_FRAG_memory"}' - '{__name__="node_sockstat_RAW_inuse"}' - '{__name__="node_sockstat_TCP_alloc"}' - '{__name__="node_sockstat_TCP_inuse"}' - '{__name__="node_sockstat_TCP_mem"}' - '{__name__="node_sockstat_TCP_mem_bytes"}' - '{__name__="node_sockstat_TCP_orphan"}' - '{__name__="node_sockstat_TCP_tw"}' - '{__name__="node_sockstat_UDPLITE_inuse"}' - '{__name__="node_sockstat_UDP_inuse"}' - '{__name__="node_sockstat_UDP_mem"}' - '{__name__="node_sockstat_UDP_mem_bytes"}' - '{__name__="node_sockstat_sockets_used"}' - '{__name__="node_softnet_dropped_total"}' - '{__name__="node_softnet_processed_total"}' - '{__name__="node_softnet_times_squeezed_total"}' - '{__name__="node_systemd_socket_accepted_connections_total"}' - '{__name__="node_systemd_units"}' - '{__name__="node_textfile_scrape_error"}' - '{__name__="node_time_seconds"}' - '{__name__="node_timex_estimated_error_seconds"}' - '{__name__="node_timex_frequency_adjustment_ratio"}' - '{__name__="node_timex_loop_time_constant"}' - '{__name__="node_timex_maxerror_seconds"}' - '{__name__="node_timex_offset_seconds"}' - '{__name__="node_timex_sync_status"}' - '{__name__="node_timex_tai_offset_seconds"}' - '{__name__="node_timex_tick_seconds"}' - '{__name__="node_uname_info"}' - '{__name__="node_vmstat_oom_kill"}' - '{__name__="node_vmstat_pgfault"}' - '{__name__="node_vmstat_pgmajfault"}' - '{__name__="node_vmstat_pgpgin"}' - '{__name__="node_vmstat_pgpgout"}' - '{__name__="node_vmstat_pswpin"}' - '{__name__="node_vmstat_pswpout"}' - '{__name__="process_cpu_seconds_total"}' - '{__name__="process_max_fds"}' - '{__name__="process_open_fds"}' - '{__name__="process_resident_memory_max_bytes"}' - '{__name__="process_virtual_memory_bytes"}' - '{__name__="process_virtual_memory_max_bytes"}' static_configs: - targets: - 'nginx.wordwisdom.tw' basic_auth: username: "user" password: "password" # Consul for proxy nodes - job_name: 'proxy_node_exporter' scrape_interval: 1m metrics_path: /system_info/hardware_metrics consul_sd_configs: - server: 'consul.service.tw' datacenter: 'dc1' token: '675bb27b-3142-6308-6146-969a14fb7dd4' services: - proxy-node-exporter basic_auth: username: "user" password: "password" relabel_configs: - source_labels: [ '__meta_consul_service_id' ] replacement: '$1' target_label: hostname remote_write: - url: 'http://influxdb:8086/api/v1/prom/write?db=prometheus&u=ibdo&p=ibdo2018' remote_read: - url: 'http://influxdb:8086/api/v1/prom/read?db=prometheus&u=ibdo&p=ibdo2018' ``` ### Ingress-controller > https://hackmd.io/@willy83310/rkcaMtDwd ## reference [**kube-prometheus**](https://github.com/prometheus-operator/kube-prometheus)