Try   HackMD

Devops Training Session 16: Setup Observability for Kubernetes

tags: devops research tutorials

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Monitoring and Observability for System

Note

For purpose monitoring and observability the Kubernetes Cluster, nowaday you have many options to handle these configuration, such as

  • Grafana
  • Prometheus
  • Loki
  • Tempo
  • Pyroscope
  • Promtail

Each Tools have different uses like

  • Detect and log the error which come from system, application with Loki
  • Debug memory leaking, performance issue with Pyroscope
  • Tracing the request of application with Tempo
  • Collect the information about system, workload, CPU/Memory of cluster in metrics with Prometheus
  • Grafana can be used for visual above information to dashboard, number. Also create the alert base on your rule, that will announce problems of system

Install and setup MnO system

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Most of components on MnO system are open-source project, they provide helm-chart that help you easily set up for your cluster. You can check about them on

Tip

You have multiple ways to apply this chart into your cluster, in the good condition, you can use helm with terraform through helm-release provider that truly insane

For example, you can configure loki like

loki/main.tf

resource "kubernetes_secret" "loki_storage_account" { metadata { name = "storage-account-key" namespace = "monitoring" } data = { STORAGE_ACCOUNT_KEY = var.remote_state.loki_storage_account_key STORAGE_ACCOUNT_NAME = var.remote_state.loki_storage_account_name } } resource "helm_release" "loki" { name = "loki" repository = "https://grafana.github.io/helm-charts" chart = "loki" version = "2.13.3" create_namespace = true namespace = "monitoring" values = [ <<EOF extraArgs: config.expand-env: true extraEnvFrom: - secretRef: name: storage-account-key rbac: pspEnabled: false nodeSelector: pool: infrapool config: server: grpc_server_max_recv_msg_size: ${var.max_recv_sent_msg_size_server} grpc_server_max_send_msg_size: ${var.max_recv_sent_msg_size_server} schema_config: configs: - from: 2022-07-18 store: boltdb-shipper object_store: azure schema: v11 index: prefix: loki_index_ period: 24h querier: query_timeout: 5m engine: timeout: 8m storage_config: boltdb_shipper: shared_store: azure active_index_directory: /data/loki/boltdb-shipper-active cache_location: /data/loki/boltdb-shipper-cache cache_ttl: 24h filesystem: directory: /data/loki/chunks # To configure the long-term storage to be Azure Blob Storage azure: # Name of container under the storage account container_name: loki # Name of storage account account_name: $${STORAGE_ACCOUNT_NAME} # Storage account key as an env from secret account_key: $${STORAGE_ACCOUNT_KEY} request_timeout: 0 # Needed for Alerting: https://grafana.com/docs/loki/latest/rules/ # This is just a simple example, for more details: https://grafana.com/docs/loki/latest/configuration/#ruler_config ruler: storage: type: local local: directory: /tmp/rules rule_path: /tmp/scratch alertmanager_url: http://kube-prometheus-stack-alertmanager.monitoring.svc:9093 enable_alertmanager_v2: true ring: kvstore: store: inmemory enable_api: true alerting_groups: - name: example rules: - alert: HighThroughputLogStreams expr: sum by(container) (rate({job=~"loki-dev/.*"}[1m])) > 1000 for: 2m EOF ] depends_on = [ kubernetes_secret.loki_storage_account ] }

After apply terraform, your component of MnO will release to namespace=monitoring

You can handle with the same idea with Grafana, Prometheus, Promtail, Tempo and Pyroscope

kube_stack_prometheus/main.tf

resource "helm_release" "kube-prometheus-stack" { name = "kube-prometheus-stack" repository = "https://prometheus-community.github.io/helm-charts" chart = "kube-prometheus-stack" version = "41.1.0" create_namespace = true namespace = "monitoring" values = [ <<EOF # Since AKS is a managed Kubernetes service, it doesn’t allow you to see internal components such as the etcd store, the controller manager, the scheduler, etc. # See https://techcommunity.microsoft.com/t5/apps-on-azure-blog/using-azure-kubernetes-service-with-grafana-and-prometheus/ba-p/3020459 kubeEtcd: enabled: false kubeControllerManager: enabled: false kubeScheduler: enabled: false kubeProxy: enabled: false # State metrics kube-state-metrics: image: repository: registry.k8s.io/kube-state-metrics/kube-state-metrics tag: v2.10.0 nodeSelector: pool: infrapool # Prometheus operator prometheusOperator: nodeSelector: pool: infrapool # Alertmanager alertmanager: enabled: false # alertmanagerSpec: # nodeSelector: # pool: infrapool # storage: # volumeClaimTemplate: # spec: # accessModes: ["ReadWriteOnce"] # resources: # requests: # storage: ${var.alertmanager_storage_size} # Prometheus prometheus: prometheusSpec: nodeSelector: pool: infrapool storageSpec: volumeClaimTemplate: spec: # Use Azure Disk or Azure files (NFS) for Prometheus storageClassName: default accessModes: ["ReadWriteOnce"] resources: requests: storage: ${var.prometheus_storage_size} additionalScrapeConfigs: - job_name: 'mongodb-exporter' static_configs: - targets: ['mongo-agent-prometheus-mongodb-exporter.monitoring.svc:9216'] - job_name: 'nginx-ingress-metrics' static_configs: - targets: ['ingress-nginx-controller-metrics.nginx-ingress.svc:10254'] - job_name: 'rabbitmq-exporter' static_configs: - targets: ['rabbitmq.infrastructure.svc:15692'] - job_name: 'es-exporter' static_configs: - targets: ['es-agent-prometheus-elasticsearch-exporter.monitoring.svc:9108'] - job_name: 'redis-exporter' static_configs: - targets: ['redis-agent-prometheus-redis-exporter.monitoring.svc:9121'] enableFeatures: - remote-write-receiver # Grafana grafana: image: tag: "10.0.0" adminPassword: ${random_string.grafana_admin_password.result} rbac: pspEnabled: false persistence: enabled: true size: ${var.grafana_storage_size} nodeSelector: pool: infrapool ingress: enabled: true ingressClassName: nginx annotations: {} labels: {} path: / pathType: Prefix hosts: - ${var.monitoring_domain_config} extraPaths: [] tls: - secretName: https-certificate-monitoring hosts: - ${var.monitoring_domain_config} datasources: datasources.yaml: apiVersion: 1 datasources: - name: Prometheus type: prometheus url: http://kube-prometheus-stack-prometheus:9090/ access: proxy isDefault: false editable: true # - name: AlertManager # type: alertmanager # uid: alertmanager # url: http://kube-stack-prometheus-kube-alertmanager.monitoring:9093/ # access: proxy # isDefault: false # editable: true # jsonData: # implementation: prometheus # exemplarTraceIdDestinations: # - name: trace_id # datasourceUid: tempo # urlDisplayLabel: View in tempo - name: loki type: loki uid: loki_svc url: http://loki.monitoring.svc:3100 access: proxy isDefault: false editable: true jsonData: manageAlerts: true alertmanagerUid: alertmanager derivedFields: - datasourceUid: tempo matcherRegex: ((\w+){16}) name: trace_id url: '$${__value.raw}' - name: tempo type: tempo uid: tempo url: http://tempo.monitoring.svc:3100 access: proxy basicAuth: false isDefault: false editable: true jsonData: httpMethod: GET tracesToLogs: datasourceUid: "loki" filterByTraceID: true filterBySpanID: true lokiSearch: datasourceUid: 'loki' EOF ] }

promtail/main.tf

resource "helm_release" "promtail" { name = "promtail" repository = "https://grafana.github.io/helm-charts" chart = "promtail" version = "6.2.2" create_namespace = true namespace = "monitoring" values = [ <<EOF config: clients: - url: http://loki:3100/loki/api/v1/push snippets: pipelineStages: - cri: {} # - labeldrop: # - filename - match: selector: '{namespace="default"}' stages: - json: expressions: TimeStamp: TimeStamp level: Level MessageTemplate: MessageTemplate - labels: level: MessageTemplate: - timestamp: source: TimeStamp format: RFC3339Nano - match: selector: '{namespace="monitoring"}' stages: - regex: expression: 'level=(?P<level>\w+)' - labels: level: EOF ] }

pyroscope/main.tf

resource "helm_release" "pyroscope" { name = "pyroscope-agent" namespace = "monitoring" repository = "https://grafana.github.io/helm-charts" chart = "pyroscope" version = "1.4.0" values = [ <<EOF pyroscope: nodeSelector: pool: infrapool persistence: enabled: true agent: enabled: false EOF ] }

tempo/main.tf

resource "helm_release" "tempo" { name = "tempo" repository = "https://grafana.github.io/helm-charts" chart = "tempo" version = "1.6.1" create_namespace = true namespace = "monitoring" values = [ <<EOF nodeSelector: pool: infrapool tempo: memBallastSizeMbs: 1024 multitenancyEnabled: false reportingEnabled: true metricsGenerator: enabled: true remoteWriteUrl: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090/api/v1/write" ingester: {} querier: {} queryFrontend: {} retention: 24h global_overrides: max_bytes_per_trace: ${var.max_bytes_per_trace} per_tenant_override_config: /conf/overrides.yaml overrides: {} server: http_listen_port: 3100 storage: trace: blocklist_poll_tenant_index_builders: 1 blocklist_poll_jitter_ms: 500 backend: azure azure: container_name: tempo storage_account_name: ${var.remote_state.tempo_storage_account_name} storage_account_key: ${var.remote_state.tempo_storage_account_key} wal: path: /var/tempo/wal receivers: jaeger: protocols: grpc: endpoint: 0.0.0.0:14250 thrift_binary: endpoint: 0.0.0.0:6832 thrift_compact: endpoint: 0.0.0.0:6831 thrift_http: endpoint: 0.0.0.0:14268 otlp: protocols: grpc: endpoint: "0.0.0.0:4317" http: endpoint: "0.0.0.0:4318" extraArgs: {} extraEnv: [] extraEnvFrom: [] extraVolumeMounts: [] config: | multitenancy_enabled: {{ .Values.tempo.multitenancyEnabled }} usage_report: reporting_enabled: {{ .Values.tempo.reportingEnabled }} compactor: compaction: block_retention: {{ .Values.tempo.retention }} distributor: receivers: {{- toYaml .Values.tempo.receivers | nindent 8 }} ingester: {{- toYaml .Values.tempo.ingester | nindent 6 }} server: {{- toYaml .Values.tempo.server | nindent 6 }} storage: {{- toYaml .Values.tempo.storage | nindent 6 }} querier: {{- toYaml .Values.tempo.querier | nindent 6 }} query_frontend: {{- toYaml .Values.tempo.queryFrontend | nindent 6 }} overrides: {{- toYaml .Values.tempo.global_overrides | nindent 6 }} {{- if .Values.tempo.metricsGenerator.enabled }} metrics_generator_processors: - 'service-graphs' - 'span-metrics' metrics_generator: storage: path: "/tmp/tempo" remote_write: - url: {{ .Values.tempo.metricsGenerator.remoteWriteUrl }} {{- end }} tempoQuery: enabled: true extraArgs: {} extraEnv: [] extraVolumeMounts: [] # persistence: # enabled: false # # storageClassName: local-path # accessModes: # - ReadWriteOnce # size: 10Gi EOF ] }

Let's talk about Grafana

If you want dive into Grafana, you will face up lots of topic to learn, including

  • Dashboard
  • Alert
  • Authentication
  • Integrating with tons of data sources

Note

Therefore, It's making Grafana to become a good option when choose the MnO whole system for free. You can use Grafana Cloud for enterprise or managed from Azure/AWS Cloud, but it's up to you

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

If you want to setup full dashboard for AKS, you can concern a bit with these dashboard to implementing inside your cluster

  1. Ingress Dashboard (AKS use Ingress): Usage: Used to look about quality and latency request, concurrency, …
  2. PostgreSQL Dashboard: Cuz Grafana provide datasource postgres and gain permit to query performance metrics from PostgreSQL so you can leverage and create a couple dashboard for yourself. Explore at Integration Performance Query for MySQL or PostgreSQL
  3. MongoDB Dashboard: With Prometheus permit to use exporter, so you can install exporter to expose metric from MongoDB cluster and Prometheus can scrape and you visualise it into Grafana. Explore at mongodb_exporter
  4. RabbitMQ Dashboard: Same with MongoDB, you can install exporter for RabbitMQ. Explore at Monitoring with Prometheus and Grafana
  5. Elasticsearch Dashboard: Same with MongoDB, RabbitMQ, you can install exporter for Elasticsearch. Explore at elasticsearch_exporter
  6. Redis Dashboard: Can also install exporter for Redis. Explore at redis_exporter

If you want dive into Alert system with Grafana, don't forget to check it out my blog Deploy your alert with Grafana by Terraform and some common error with K8s

Troubleshoot MnO System

Can't not read data

Error : Occur when your storage has problems with any components

Note

All components in MnO system are already store data to azure-disk with name contain the service

Troubleshoot : Check on dashboard AKS portal about component, Does component attach with azure-disk ? Does the azure-disk exist or not for service ? Can attach that for your service via value helm-chart ?

Disconnection with prometheus

image

Error : Prometheus is restart or not running in currently

Troubleshoot : Wait 30s for Prometheus restart and query dashboard again. If not, refer to kubectl command to check status of Prometheus

Queries is not valid or datasource have problem

image

Error : Occur when using wrong queries or datasource have not response.

Troubleshoot : Check queries again, if not, please use kubectl to look up what happen with datasource you want to search (Ex: Loki, Tempo,)

Loki with long queries time out

Error : Error when set time queries range to large or log storage in this time range out of range.

Troubleshoot : Reduce size of time range and choose specifically to increase exact log you want to check.

Cannot query metric from exporter

Error : Occur when exporter is restart, failure or DNS scraping not working

Troubleshoot:

  • Check status of exporter, if restart you need wait few second or delete pod for restart it if state is failure
  • Double check scrape configuration on prometheus helm-chart , DNS of exporter matching prefix <name-of-service>.<namespace>.svc:<port-service>
# Prometheus
prometheus:
  prometheusSpec:
    nodeSelector:
      pool: infrapool
    storageSpec:
      volumeClaimTemplate:
        spec:
          # Use Azure Disk or Azure files (NFS) for Prometheus
          storageClassName: default
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: ${var.prometheus_storage_size}
    additionalScrapeConfigs:
    - job_name: 'mongodb-exporter'
      static_configs:
        - targets: ['mongo-agent-prometheus-mongodb-exporter.monitoring.svc:9216']
    - job_name: 'nginx-ingress-metrics'
      static_configs:
        - targets: ['ingress-nginx-controller-metrics.nginx-ingress.svc:10254']
    - job_name: 'rabbitmq-exporter'
      static_configs:
        - targets: ['rabbitmq.infrastructure.svc:15692']
    - job_name: 'es-exporter'
      static_configs:
        - targets: ['es-agent-prometheus-elasticsearch-exporter.monitoring.svc:9108']
    - job_name: 'redis-exporter'
      static_configs:
        - targets: ['redis-agent-prometheus-redis-exporter.monitoring.svc:9121']
    enableFeatures:
      - remote-write-receiver

Alert announce NULL information

Error : Occur when your alert is spam with over length of query loki

Troubleshoot :

  • Turn on the silent mode for spam alert
  • Restart or removing when it occurs error with confusion exception and create again
  • If check the exception over length, Silent alert and ignore them

Conclusion

Note

On this topic, you will understand about

  • Monitoring and Observability of System
  • Install and setup Monitoring and Observability System
  • Dashboard of Grafana
  • Alert System by Grafana
  • Trouble the Monitoring and Observability System

Reference