--- tags: prometheus, WAMS --- # k8s監控方案 [TOC] ![](https://i.imgur.com/1yWytDd.jpg) ## 主要元件介紹 ### Prometheus-operator > github: https://github.com/prometheus-operator/prometheus-operator **Prometheus Operator**是**CoreOS**開源的一套用於管理在 Kubernetes上的Prometheus控制器,利用custom,目標是<font color='red'>**簡化部署與維護 Prometheus**</font>上的事情。 :::spoiler Prometheus-operator有以下特點 1. **Kubernetes Custom Resources**: Use Kubernetes <font color='red'>custom resources</font> to deploy and manage Prometheus, Alertmanager, and related components. 2. **Simplified Deployment Configuration**: Configure the fundamentals of Prometheus like versions, persistence, retention policies, and replicas from a native Kubernetes resource. 3. **Prometheus Target Configuration**: Automatically generate monitoring target configurations based on familiar Kubernetes label queries; no need to learn a Prometheus specific configuration language. ::: :::spoiler Prometheus-operator包含以下主要元件: * `Prometheus`: which defines a desired Prometheus deployment. * `Alertmanager`: which defines a desired Alertmanager deployment. * `ThanosRuler`: which defines a desired Thanos Ruler deployment. * `ServiceMonitor`: which declaratively specifies how groups of Kubernetes **services** should be monitored. The Operator <font color='red'>**automatically generates Prometheus scrape configuration**</font> based on the current state of the objects in the API server. * `PodMonitor`: which declaratively specifies how group of **pods** should be monitored. The Operator <font color='red'>**automatically generates Prometheus scrape configuration**</font> based on the current state of the objects in the API server. * `Probe`: which declaratively specifies how groups of ingresses or static targets should be monitored. The Operator automatically generates Prometheus scrape configuration based on the definition. * `PrometheusRule`: which defines a desired set of Prometheus alerting and/or recording rules. The Operator generates a rule file, which can be used by Prometheus instances. * `AlertmanagerConfig`: which declaratively specifies subsections of the Alertmanager configuration, allowing routing of alerts to custom receivers, and setting inhibit rules. ::: --- ### Prometheus > github: https://github.com/prometheus/prometheus ![image alt](https://camo.githubusercontent.com/f14ac82eda765733a5f2b5200d78b4ca84b62559d17c9835068423b223588939/68747470733a2f2f63646e2e6a7364656c6976722e6e65742f67682f70726f6d6574686575732f70726f6d65746865757340633334323537643036396336333036383564613335626365663038343633326666643564363230392f646f63756d656e746174696f6e2f696d616765732f6172636869746563747572652e737667) :::spoiler Prometheus與其他監控系統的差別 1. A multi-dimensional data model (time series defined by metric name and set of key/value dimensions) 2. <font color='red'>PromQL</font>, a powerful and flexible query language to leverage this dimensionality 3. No dependency on distributed storage; single server nodes are autonomous 4. An <font color='red'>HTTP pull model</font> for time series collection 5. Pushing time series is supported via an intermediary gateway(<font color='red'>push gateway</font>) for batch jobs 6. Targets are discovered via service discovery(EX: **consul**) or static configuration(**prometheus.yml**) 7. Multiple modes of graphing and dashboarding support 8. Support for hierarchical and horizontal <font color='red'>federation</font> ::: --- ### Node-exporter > github: https://github.com/prometheus/node_exporter Prometheus exporter for **hardware and OS metrics exposed** by *NIX kernels, written in Go with pluggable metric collectors. --- ### Kube-stat-metrics > github: https://github.com/kubernetes/kube-state-metrics > more exposed metrics: https://github.com/kubernetes/kube-state-metrics/tree/master/docs kube-state-metrics is a simple service that **listens to the Kubernetes API server** and generates metrics(EX: `pod state`, `container state`, `endpoints`, `service`) about the state of the objects. --- ### Prometheus-Adapter > github: https://github.com/kubernetes-sigs/prometheus-adapter This repository contains an implementation of the Kubernetes [resource metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/resource-metrics-api.md), [custom metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md), and [external metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/external-metrics-api.md) APIs. --- ### Ingress #### Ingress resource 主要制定routing rule,主要的設定會落在spec,以及依賴底下實作不同,額外設定的annotation。 #### Ingress controller > Ingress controller種類: https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/#additional-controllers 負責實做底層的服務,再依照Ingress resource的設定去動態修改nginx pod裡的設定。 --- ## 安裝流程 ### Kube-prometheus :::spoiler This project included in this package: * Prometheus Operator * Highly available Prometheus * Highly available Alertmanager(建在本地端,gke不會建此服務) * Prometheus node-exporter * Prometheus Adapter for Kubernetes Metrics APIs * Kube-state-metrics * Grafana(建在本地端,gke不會建此服務) ::: <br> 1. Check Prerequisites kubelet configuration must contain these flags: * `--authentication-token-webhook=true` * `--authorization-mode=Webhook` 2. Check Compatibility | kube-prometheus stack | Kubernetes 1.16 | Kubernetes 1.17 | Kubernetes 1.18 | Kubernetes 1.19 | Kubernetes 1.20 | | --------------------- | --------------- | --------------- | --------------- | --------------- | --------------- | | `release-0.4` | ✔ (v1.16.5+) | ✔ | ✗ | ✗ | ✗ | | `release-0.5` | ✗ | ✗ | ✔ | ✗ | ✗ | | `release-0.6` | ✗ | ✗ | ✗ | ✔ | ✗ | | `release-0.7` | ✗ | ✗ | ✗ | ✔ | ✔ | | `HEAD` | ✗ | ✗ | ✗ | ✔ | ✔ | 3. Clone project kube-prometheus ```bash= $ git clone https://github.com/prometheus-operator/kube-prometheus.git ``` 4. Checkout project version according to Compatibility ```bash= $ git checkout <tag or commit SHA> ``` 5. Remove grafana、alert manager yaml ```bash= $ cd kube-prometheus $ rm -f manifests/alertmanager-* $ rm -f manifests/grafana-* ``` 6. Create the monitoring stack using the config in the manifests directory ```bash= ! # Create the namespace and CRDs, and then wait for them to be available before creating the remaining resources $ kubectl create -f manifests/setup $ until kubectl get servicemonitors --all-namespaces ;do date; sleep 1; echo ""; done $ kubectl create -f manifests/ ``` To teardown the stack: ```bash= $ kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup ``` 7. (TEST,optional)Access the dashboards by port-forwarding * Prometheus ```bash ! $ kubectl --namespace monitoring port-forward svc/prometheus-k8s 9090 ``` Then access via http://{HOST_IP}:9090 * Grafana ```bash ! $ kubectl --namespace monitoring port-forward svc/grafana 3000 ``` Then access via http://localhost:3000 and use the default grafana `user:password` of `admin:admin`. ### Consul > https://hackmd.io/Uew-BqRWQz6qxSix47SMkg ### Prometheus > prometheus stacks: http://ibdo.efoxconn.com:5000/QAD/prometheus-stack docker-compose.yml ```yaml= ! version: "3.1" services: grafana: image: grafana/grafana:7.3.5 container_name: grafana ports: - '3000:3000' environment: - GF_SECURITY_ADMIN_USER=ibdo # grafana user - GF_SECURITY_ADMIN_PASSWORD=ibdo2018 # grafana password volumes: - grafana_data:/var/lib/grafana - ./grafana/provisioning/:/etc/grafana/provisioning/ # 放grafana dashborad json restart: always depends_on: - prometheus prometheus: image: prom/prometheus:v2.23.0 container_name: prometheus ports: - '9090:9090' command: - '--config.file=/etc/prometheus/prometheus.yml' # 指定prometheus config path - '--storage.tsdb.retention.time=2h' # 只保留兩個小時的資料 - '--web.enable-lifecycle' # 可以使用post去重新讀取設定 volumes: - ./prometheus/config/prometheus.yml:/etc/prometheus/prometheus.yml - prometheus_data:/prometheus restart: always depends_on: - influxdb influxdb: image: influxdb:1.8.3 container_name: influxdb ports: - '8086:8086' environment: - INFLUXDB_DB=prometheus - INFLUXDB_ADMIN_USER=ibdo - INFLUXDB_ADMIN_PASSWORD=ibdo2018 - INFLUXDB_DATA_MAX_SERIES_PER_DATABASE=0 # The maximum number of series allowed per database before writes are dropped. restart: always volumes: - influxdb_data:/var/lib/influxdb volumes: prometheus_data: {} grafana_data: {} influxdb_data: {} ``` prometheus.yml ```yaml= global: scrape_interval: 15s # By default, scrape targets every 15 seconds. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' scrape_interval: 5s static_configs: - targets: ['localhost:9090'] # GKE Kube-prometheus - job_name: 'federate' scrape_interval: 5m scrape_timeout: 3m honor_labels: true metrics_path: '/federate' params: 'match[]': # kube_prometheus_metrics - '{__name__="cass_jvm_heap"}' - '{__name__="cass_jvm_heap_max"}' - '{__name__="cass_jvm_noheap"}' - '{__name__="cass_jvm_noheap_max"}' - '{__name__="container_cpu_usage_seconds_total"}' - '{__name__="container_fs_limit_bytes"}' - '{__name__="container_fs_usage_bytes"}' - '{__name__="container_memory_rss"}' - '{__name__="container_memory_working_set_bytes"}' - '{__name__="container_network_receive_bytes_total"}' - '{__name__="container_network_transmit_bytes_total"}' - '{__name__="kube_configmap_info"}' - '{__name__="kube_namespace_labels"}' - '{__name__="kube_node_info"}' - '{__name__="kube_node_status_allocatable_cpu_cores"}' - '{__name__="kube_node_status_capacity_cpu_cores"}' - '{__name__="kube_node_status_capacity_memory_bytes"}' - '{__name__="kube_node_status_capacity_pods"}' - '{__name__="kube_node_status_condition"}' - '{__name__="kube_pod_container_info"}' - '{__name__="kube_pod_container_resource_limits_cpu_cores"}' - '{__name__="kube_pod_container_resource_limits_memory_bytes"}' - '{__name__="kube_pod_container_resource_requests_cpu_cores"}' - '{__name__="kube_pod_container_resource_requests_memory_bytes"}' - '{__name__="kube_pod_container_status_restarts_total"}' - '{__name__="kube_pod_info"}' - '{__name__="kube_pod_status_phase"}' - '{__name__="kube_secret_info"}' - '{__name__="kube_service_info"}' - '{__name__="machine_cpu_cores"}' - '{__name__="machine_memory_bytes"}' - '{__name__="origin_prometheus"}' # node_exporter_full_metrics - '{__name__="node_arp_entries"}' - '{__name__="node_context_switches_total"}' - '{__name__="node_cooling_device_cur_state"}' - '{__name__="node_cooling_device_max_state"}' - '{__name__="node_cpu_seconds_total"}' - '{__name__="node_disk_discard_time_seconds_total"}' - '{__name__="node_disk_discards_completed_total"}' - '{__name__="node_disk_discards_merged_total"}' - '{__name__="node_disk_io_now"}' - '{__name__="node_disk_io_time_seconds_total"}' - '{__name__="node_disk_io_time_weighted_seconds_total"}' - '{__name__="node_disk_read_bytes_total"}' - '{__name__="node_disk_read_time_seconds_total"}' - '{__name__="node_disk_reads_completed_total"}' - '{__name__="node_disk_reads_merged_total"}' - '{__name__="node_disk_write_time_seconds_total"}' - '{__name__="node_disk_writes_completed_total"}' - '{__name__="node_disk_writes_merged_total"}' - '{__name__="node_disk_written_bytes_total"}' - '{__name__="node_entropy_available_bits"}' - '{__name__="node_filefd_allocated"}' - '{__name__="node_filefd_maximum"}' - '{__name__="node_filesystem_avail_bytes"}' - '{__name__="node_filesystem_device_error"}' - '{__name__="node_filesystem_files"}' - '{__name__="node_filesystem_files_free"}' - '{__name__="node_filesystem_free_bytes"}' - '{__name__="node_filesystem_readonly"}' - '{__name__="node_filesystem_size_bytes"}' - '{__name__="node_forks_total"}' - '{__name__="node_hwmon_temp_celsius"}' - '{__name__="node_hwmon_temp_crit_alarm_celsius"}' - '{__name__="node_hwmon_temp_crit_celsius"}' - '{__name__="node_hwmon_temp_crit_hyst_celsius"}' - '{__name__="node_hwmon_temp_max_celsius"}' - '{__name__="node_interrupts_total"}' - '{__name__="node_intr_total"}' - '{__name__="node_load1"}' - '{__name__="node_load15"}' - '{__name__="node_load5"}' - '{__name__="node_memory_Active_anon_bytes"}' - '{__name__="node_memory_Active_bytes"}' - '{__name__="node_memory_Active_file_bytes"}' - '{__name__="node_memory_AnonHugePages_bytes"}' - '{__name__="node_memory_AnonPages_bytes"}' - '{__name__="node_memory_Bounce_bytes"}' - '{__name__="node_memory_Buffers_bytes"}' - '{__name__="node_memory_Cached_bytes"}' - '{__name__="node_memory_CommitLimit_bytes"}' - '{__name__="node_memory_Committed_AS_bytes"}' - '{__name__="node_memory_DirectMap1G_bytes"}' - '{__name__="node_memory_DirectMap2M_bytes"}' - '{__name__="node_memory_DirectMap4k_bytes"}' - '{__name__="node_memory_Dirty_bytes"}' - '{__name__="node_memory_HardwareCorrupted_bytes"}' - '{__name__="node_memory_HugePages_Free"}' - '{__name__="node_memory_HugePages_Rsvd"}' - '{__name__="node_memory_HugePages_Surp"}' - '{__name__="node_memory_HugePages_Total"}' - '{__name__="node_memory_Hugepagesize_bytes"}' - '{__name__="node_memory_Inactive_anon_bytes"}' - '{__name__="node_memory_Inactive_bytes"}' - '{__name__="node_memory_Inactive_file_bytes"}' - '{__name__="node_memory_KernelStack_bytes"}' - '{__name__="node_memory_Mapped_bytes"}' - '{__name__="node_memory_MemFree_bytes"}' - '{__name__="node_memory_MemTotal_bytes"}' - '{__name__="node_memory_Mlocked_bytes"}' - '{__name__="node_memory_NFS_Unstable_bytes"}' - '{__name__="node_memory_PageTables_bytes"}' - '{__name__="node_memory_Percpu_bytes"}' - '{__name__="node_memory_SReclaimable_bytes"}' - '{__name__="node_memory_SUnreclaim_bytes"}' - '{__name__="node_memory_ShmemHugePages_bytes"}' - '{__name__="node_memory_ShmemPmdMapped_bytes"}' - '{__name__="node_memory_Shmem_bytes"}' - '{__name__="node_memory_Slab_bytes"}' - '{__name__="node_memory_SwapCached_bytes"}' - '{__name__="node_memory_SwapTotal_bytes"}' - '{__name__="node_memory_Unevictable_bytes"}' - '{__name__="node_memory_VmallocChunk_bytes"}' - '{__name__="node_memory_VmallocTotal_bytes"}' - '{__name__="node_memory_VmallocUsed_bytes"}' - '{__name__="node_memory_WritebackTmp_bytes"}' - '{__name__="node_memory_Writeback_bytes"}' - '{__name__="node_netstat_Icmp_InErrors"}' - '{__name__="node_netstat_Icmp_InMsgs"}' - '{__name__="node_netstat_Icmp_OutMsgs"}' - '{__name__="node_netstat_IpExt_InOctets"}' - '{__name__="node_netstat_IpExt_OutOctets"}' - '{__name__="node_netstat_Ip_Forwarding"}' - '{__name__="node_netstat_TcpExt_ListenDrops"}' - '{__name__="node_netstat_TcpExt_ListenOverflows"}' - '{__name__="node_netstat_TcpExt_SyncookiesFailed"}' - '{__name__="node_netstat_TcpExt_SyncookiesRecv"}' - '{__name__="node_netstat_TcpExt_SyncookiesSent"}' - '{__name__="node_netstat_TcpExt_TCPSynRetrans"}' - '{__name__="node_netstat_Tcp_ActiveOpens"}' - '{__name__="node_netstat_Tcp_CurrEstab"}' - '{__name__="node_netstat_Tcp_InErrs"}' - '{__name__="node_netstat_Tcp_InSegs"}' - '{__name__="node_netstat_Tcp_MaxConn"}' - '{__name__="node_netstat_Tcp_OutSegs"}' - '{__name__="node_netstat_Tcp_PassiveOpens"}' - '{__name__="node_netstat_Tcp_RetransSegs"}' - '{__name__="node_netstat_UdpLite_InErrors"}' - '{__name__="node_netstat_Udp_InDatagrams"}' - '{__name__="node_netstat_Udp_InErrors"}' - '{__name__="node_netstat_Udp_NoPorts"}' - '{__name__="node_netstat_Udp_OutDatagrams"}' - '{__name__="node_netstat_Udp_RcvbufErrors"}' - '{__name__="node_netstat_Udp_SndbufErrors"}' - '{__name__="node_network_carrier"}' - '{__name__="node_network_mtu_bytes"}' - '{__name__="node_network_receive_bytes_total"}' - '{__name__="node_network_receive_compressed_total"}' - '{__name__="node_network_receive_drop_total"}' - '{__name__="node_network_receive_errs_total"}' - '{__name__="node_network_receive_fifo_total"}' - '{__name__="node_network_receive_frame_total"}' - '{__name__="node_network_receive_multicast_total"}' - '{__name__="node_network_receive_packets_total"}' - '{__name__="node_network_speed_bytes"}' - '{__name__="node_network_transmit_bytes_total"}' - '{__name__="node_network_transmit_carrier_total"}' - '{__name__="node_network_transmit_colls_total"}' - '{__name__="node_network_transmit_compressed_total"}' - '{__name__="node_network_transmit_drop_total"}' - '{__name__="node_network_transmit_errs_total"}' - '{__name__="node_network_transmit_fifo_total"}' - '{__name__="node_network_transmit_packets_total"}' - '{__name__="node_network_transmit_queue_length"}' - '{__name__="node_network_up"}' - '{__name__="node_nf_conntrack_entries"}' - '{__name__="node_nf_conntrack_entries_limit"}' - '{__name__="node_power_supply_online"}' - '{__name__="node_processes_max_processes"}' - '{__name__="node_processes_max_threads"}' - '{__name__="node_processes_pids"}' - '{__name__="node_processes_state"}' - '{__name__="node_processes_threads"}' - '{__name__="node_procs_blocked"}' - '{__name__="node_procs_running"}' - '{__name__="node_schedstat_running_seconds_total"}' - '{__name__="node_schedstat_timeslices_total"}' - '{__name__="node_schedstat_waiting_seconds_total"}' - '{__name__="node_scrape_collector_duration_seconds"}' - '{__name__="node_scrape_collector_success"}' - '{__name__="node_sockstat_FRAG_inuse"}' - '{__name__="node_sockstat_FRAG_memory"}' - '{__name__="node_sockstat_RAW_inuse"}' - '{__name__="node_sockstat_TCP_alloc"}' - '{__name__="node_sockstat_TCP_inuse"}' - '{__name__="node_sockstat_TCP_mem"}' - '{__name__="node_sockstat_TCP_mem_bytes"}' - '{__name__="node_sockstat_TCP_orphan"}' - '{__name__="node_sockstat_TCP_tw"}' - '{__name__="node_sockstat_UDPLITE_inuse"}' - '{__name__="node_sockstat_UDP_inuse"}' - '{__name__="node_sockstat_UDP_mem"}' - '{__name__="node_sockstat_UDP_mem_bytes"}' - '{__name__="node_sockstat_sockets_used"}' - '{__name__="node_softnet_dropped_total"}' - '{__name__="node_softnet_processed_total"}' - '{__name__="node_softnet_times_squeezed_total"}' - '{__name__="node_systemd_socket_accepted_connections_total"}' - '{__name__="node_systemd_units"}' - '{__name__="node_textfile_scrape_error"}' - '{__name__="node_time_seconds"}' - '{__name__="node_timex_estimated_error_seconds"}' - '{__name__="node_timex_frequency_adjustment_ratio"}' - '{__name__="node_timex_loop_time_constant"}' - '{__name__="node_timex_maxerror_seconds"}' - '{__name__="node_timex_offset_seconds"}' - '{__name__="node_timex_sync_status"}' - '{__name__="node_timex_tai_offset_seconds"}' - '{__name__="node_timex_tick_seconds"}' - '{__name__="node_uname_info"}' - '{__name__="node_vmstat_oom_kill"}' - '{__name__="node_vmstat_pgfault"}' - '{__name__="node_vmstat_pgmajfault"}' - '{__name__="node_vmstat_pgpgin"}' - '{__name__="node_vmstat_pgpgout"}' - '{__name__="node_vmstat_pswpin"}' - '{__name__="node_vmstat_pswpout"}' - '{__name__="process_cpu_seconds_total"}' - '{__name__="process_max_fds"}' - '{__name__="process_open_fds"}' - '{__name__="process_resident_memory_max_bytes"}' - '{__name__="process_virtual_memory_bytes"}' - '{__name__="process_virtual_memory_max_bytes"}' static_configs: - targets: - 'nginx.wordwisdom.tw' basic_auth: username: "user" password: "password" # Consul for proxy nodes - job_name: 'proxy_node_exporter' scrape_interval: 1m metrics_path: /system_info/hardware_metrics consul_sd_configs: - server: 'consul.service.tw' datacenter: 'dc1' token: '675bb27b-3142-6308-6146-969a14fb7dd4' services: - proxy-node-exporter basic_auth: username: "user" password: "password" relabel_configs: - source_labels: [ '__meta_consul_service_id' ] replacement: '$1' target_label: hostname remote_write: - url: 'http://influxdb:8086/api/v1/prom/write?db=prometheus&u=ibdo&p=ibdo2018' remote_read: - url: 'http://influxdb:8086/api/v1/prom/read?db=prometheus&u=ibdo&p=ibdo2018' ``` ### Ingress-controller > https://hackmd.io/@willy83310/rkcaMtDwd ## reference [**kube-prometheus**](https://github.com/prometheus-operator/kube-prometheus)