Devops Training Session 16: Setup Observability for Kubernetes

tags: `devops` `research` `tutorials`

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Monitoring and Observability for System

Note

For purpose monitoring and observability the Kubernetes Cluster, nowaday you have many options to handle these configuration, such as

Grafana
Prometheus
Loki
Tempo
Pyroscope
Promtail

Each Tools have different uses like

Detect and log the error which come from system, application with Loki
Debug memory leaking, performance issue with Pyroscope
Tracing the request of application with Tempo
Collect the information about system, workload, CPU/Memory of cluster in metrics with Prometheus
Grafana can be used for visual above information to dashboard, number. Also create the alert base on your rule, that will announce problems of system

Install and setup MnO system

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Most of components on MnO system are open-source project, they provide helm-chart that help you easily set up for your cluster. You can check about them on

Grafana + Prometheus: Chart
Loki: Chart
Promtail: Chart
Tempo: Chart
Pyroscope: Chart

Tip

You have multiple ways to apply this chart into your cluster, in the good condition, you can use helm with terraform through helm-release provider that truly insane

For example, you can configure loki like

loki/main.tf






























































































resource "kubernetes_secret" "loki_storage_account" {
  metadata {
    name      = "storage-account-key"
    namespace = "monitoring"
  }
  data = {
    STORAGE_ACCOUNT_KEY  = var.remote_state.loki_storage_account_key
    STORAGE_ACCOUNT_NAME = var.remote_state.loki_storage_account_name
  }
}
 
resource "helm_release" "loki" {
  name = "loki"
 
  repository       = "https://grafana.github.io/helm-charts"
  chart            = "loki"
  version          = "2.13.3"
  create_namespace = true
  namespace        = "monitoring"
 
  values = [
    <<EOF
extraArgs:
  config.expand-env: true
extraEnvFrom:
  - secretRef:
      name: storage-account-key
 
rbac:
  pspEnabled: false
nodeSelector:
  pool: infrapool
config:
  server:
    grpc_server_max_recv_msg_size: ${var.max_recv_sent_msg_size_server}
    grpc_server_max_send_msg_size: ${var.max_recv_sent_msg_size_server}
  schema_config:
    configs:
      - from: 2022-07-18
        store: boltdb-shipper
        object_store: azure
        schema: v11
        index:
          prefix: loki_index_
          period: 24h
  querier:
    query_timeout: 5m
    engine:
        timeout: 8m
  storage_config:
    boltdb_shipper:
      shared_store: azure
      active_index_directory: /data/loki/boltdb-shipper-active
      cache_location: /data/loki/boltdb-shipper-cache
      cache_ttl: 24h
    filesystem:
      directory: /data/loki/chunks
    # To configure the long-term storage to be Azure Blob Storage
    azure:
      # Name of container under the storage account
      container_name: loki
      # Name of storage account
      account_name: $${STORAGE_ACCOUNT_NAME}
      # Storage account key as an env from secret
      account_key: $${STORAGE_ACCOUNT_KEY}
      request_timeout: 0
 
# Needed for Alerting: https://grafana.com/docs/loki/latest/rules/
# This is just a simple example, for more details: https://grafana.com/docs/loki/latest/configuration/#ruler_config
  ruler:
    storage:
      type: local
      local:
        directory: /tmp/rules
    rule_path: /tmp/scratch
    alertmanager_url: http://kube-prometheus-stack-alertmanager.monitoring.svc:9093
    enable_alertmanager_v2: true
    ring:
      kvstore:
        store: inmemory
    enable_api: true
 
alerting_groups:
  - name: example
    rules:
    - alert: HighThroughputLogStreams
      expr: sum by(container) (rate({job=~"loki-dev/.*"}[1m])) > 1000
      for: 2m
EOF
  ]
  depends_on = [
    kubernetes_secret.loki_storage_account
  ]
}

After apply terraform, your component of MnO will release to namespace=monitoring

You can handle with the same idea with Grafana, Prometheus, Promtail, Tempo and Pyroscope

kube_stack_prometheus/main.tf








































































































































































resource "helm_release" "kube-prometheus-stack" {
  name = "kube-prometheus-stack"

  repository       = "https://prometheus-community.github.io/helm-charts"
  chart            = "kube-prometheus-stack"
  version          = "41.1.0"
  create_namespace = true
  namespace        = "monitoring"

  values = [
    <<EOF
# Since AKS is a managed Kubernetes service, it doesn’t allow you to see internal components such as the etcd store, the controller manager, the scheduler, etc.
# See https://techcommunity.microsoft.com/t5/apps-on-azure-blog/using-azure-kubernetes-service-with-grafana-and-prometheus/ba-p/3020459
kubeEtcd:
  enabled: false
kubeControllerManager:
  enabled: false
kubeScheduler:
  enabled: false
kubeProxy:
  enabled: false

# State metrics
kube-state-metrics:
  image:
    repository: registry.k8s.io/kube-state-metrics/kube-state-metrics
    tag: v2.10.0
  nodeSelector:
    pool: infrapool

# Prometheus operator
prometheusOperator:
  nodeSelector:
    pool: infrapool

# Alertmanager
alertmanager:
  enabled: false
  # alertmanagerSpec:
  #   nodeSelector:
  #     pool: infrapool
  #   storage:
  #     volumeClaimTemplate:
  #       spec:
  #         accessModes: ["ReadWriteOnce"]
  #         resources:
  #           requests:
  #             storage: ${var.alertmanager_storage_size}

# Prometheus
prometheus:
  prometheusSpec:
    nodeSelector:
      pool: infrapool
    storageSpec:
      volumeClaimTemplate:
        spec:
          # Use Azure Disk or Azure files (NFS) for Prometheus
          storageClassName: default
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: ${var.prometheus_storage_size}
    additionalScrapeConfigs:
    - job_name: 'mongodb-exporter'
      static_configs:
        - targets: ['mongo-agent-prometheus-mongodb-exporter.monitoring.svc:9216']
    - job_name: 'nginx-ingress-metrics'
      static_configs:
        - targets: ['ingress-nginx-controller-metrics.nginx-ingress.svc:10254']
    - job_name: 'rabbitmq-exporter'
      static_configs:
        - targets: ['rabbitmq.infrastructure.svc:15692']
    - job_name: 'es-exporter'
      static_configs:
        - targets: ['es-agent-prometheus-elasticsearch-exporter.monitoring.svc:9108']
    - job_name: 'redis-exporter'
      static_configs:
        - targets: ['redis-agent-prometheus-redis-exporter.monitoring.svc:9121']
    enableFeatures:
      - remote-write-receiver
# Grafana
grafana:
  image:
    tag: "10.0.0"
  adminPassword: ${random_string.grafana_admin_password.result}
  rbac:
    pspEnabled: false
  persistence:
    enabled: true
    size: ${var.grafana_storage_size}
    
  nodeSelector:
    pool: infrapool

  ingress:
    enabled: true
    ingressClassName: nginx
    annotations: {}
    labels: {}
    path: /
    pathType: Prefix
    hosts:
      - ${var.monitoring_domain_config}
    extraPaths: []
    tls:
     - secretName: https-certificate-monitoring
       hosts:
         - ${var.monitoring_domain_config}

  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
      - name: Prometheus
        type: prometheus
        url: http://kube-prometheus-stack-prometheus:9090/
        access: proxy
        isDefault: false
        editable: true
      # - name: AlertManager
      #   type: alertmanager
      #   uid: alertmanager
      #   url: http://kube-stack-prometheus-kube-alertmanager.monitoring:9093/
      #   access: proxy
      #   isDefault: false
      #   editable: true
      #   jsonData:
      #     implementation: prometheus
      #     exemplarTraceIdDestinations:
      #       - name: trace_id
      #         datasourceUid: tempo
      #         urlDisplayLabel: View in tempo
      - name: loki
        type: loki
        uid: loki_svc
        url: http://loki.monitoring.svc:3100
        access: proxy
        isDefault: false
        editable: true
        jsonData:
          manageAlerts: true
          alertmanagerUid: alertmanager
          derivedFields:
            - datasourceUid: tempo
              matcherRegex: ((\w+){16})
              name: trace_id
              url: '$${__value.raw}'
      - name: tempo
        type: tempo
        uid: tempo
        url: http://tempo.monitoring.svc:3100
        access: proxy
        basicAuth: false
        isDefault: false
        editable: true
        jsonData:
          httpMethod: GET
          tracesToLogs:
            datasourceUid: "loki"
            filterByTraceID: true
            filterBySpanID: true
          lokiSearch:
            datasourceUid: 'loki'

EOF
  ]
}

promtail/main.tf












































resource "helm_release" "promtail" {
  name = "promtail"

  repository       = "https://grafana.github.io/helm-charts"
  chart            = "promtail"
  version          = "6.2.2"
  create_namespace = true
  namespace        = "monitoring"

  values = [
    <<EOF
config:
  clients:
    - url: http://loki:3100/loki/api/v1/push
  snippets:
    pipelineStages:
    - cri: {}
    # - labeldrop:
    #   - filename
    - match:
        selector: '{namespace="default"}'
        stages:
          - json:
              expressions:
                TimeStamp: TimeStamp
                level: Level
                MessageTemplate: MessageTemplate
          - labels:
              level:
              MessageTemplate:
          - timestamp:
              source: TimeStamp
              format: RFC3339Nano
    - match:
        selector: '{namespace="monitoring"}'
        stages:
          - regex:
              expression: 'level=(?P<level>\w+)'
          - labels:
              level:

EOF
  ]
}

pyroscope/main.tf



















resource "helm_release" "pyroscope" {
  name       = "pyroscope-agent"
  namespace  = "monitoring"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "pyroscope"
  version    = "1.4.0"

  values = [
    <<EOF
pyroscope:
  nodeSelector:
      pool: infrapool
  persistence:
    enabled: true
agent:
    enabled: false
EOF
  ]
}

tempo/main.tf
















































































































resource "helm_release" "tempo" {
  name = "tempo"

  repository       = "https://grafana.github.io/helm-charts"
  chart            = "tempo"
  version          = "1.6.1"
  create_namespace = true
  namespace        = "monitoring"

  values = [
    <<EOF
nodeSelector:
  pool: infrapool
tempo:
  memBallastSizeMbs: 1024
  multitenancyEnabled: false
  reportingEnabled: true
  metricsGenerator:
    enabled: true
    remoteWriteUrl: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090/api/v1/write"
  ingester: {}
  querier: {}
  queryFrontend: {}
  retention: 24h
  global_overrides:
    max_bytes_per_trace: ${var.max_bytes_per_trace}
    per_tenant_override_config: /conf/overrides.yaml
  overrides: {}
  server:
    http_listen_port: 3100
  storage:
    trace:
      blocklist_poll_tenant_index_builders: 1
      blocklist_poll_jitter_ms: 500
      backend: azure
      azure:
        container_name: tempo
        storage_account_name: ${var.remote_state.tempo_storage_account_name}
        storage_account_key: ${var.remote_state.tempo_storage_account_key}
      wal:
        path: /var/tempo/wal
  receivers:
    jaeger:
      protocols:
        grpc:
          endpoint: 0.0.0.0:14250
        thrift_binary:
          endpoint: 0.0.0.0:6832
        thrift_compact:
          endpoint: 0.0.0.0:6831
        thrift_http:
          endpoint: 0.0.0.0:14268
    otlp:
      protocols:
        grpc:
          endpoint: "0.0.0.0:4317"
        http:
          endpoint: "0.0.0.0:4318"
  extraArgs: {}
  extraEnv: []
  extraEnvFrom: []
  extraVolumeMounts: []
config: |
  multitenancy_enabled: {{ .Values.tempo.multitenancyEnabled }}
  usage_report:
    reporting_enabled: {{ .Values.tempo.reportingEnabled }}
  compactor:
    compaction:
      block_retention: {{ .Values.tempo.retention }}
  distributor:
    receivers:
      {{- toYaml .Values.tempo.receivers | nindent 8 }}
  ingester:
    {{- toYaml .Values.tempo.ingester | nindent 6 }}
  server:
    {{- toYaml .Values.tempo.server | nindent 6 }}
  storage:
    {{- toYaml .Values.tempo.storage | nindent 6 }}
  querier:
    {{- toYaml .Values.tempo.querier | nindent 6 }}
  query_frontend:
    {{- toYaml .Values.tempo.queryFrontend | nindent 6 }}
  overrides:
    {{- toYaml .Values.tempo.global_overrides | nindent 6 }}
    {{- if .Values.tempo.metricsGenerator.enabled }}
        metrics_generator_processors:
        - 'service-graphs'
        - 'span-metrics'
  metrics_generator:
        storage:
          path: "/tmp/tempo"
          remote_write:
            - url: {{ .Values.tempo.metricsGenerator.remoteWriteUrl }}
    {{- end }}

tempoQuery:
  enabled: true
  extraArgs: {}
  extraEnv: []
  extraVolumeMounts: []

# persistence:
#   enabled: false
#   # storageClassName: local-path
#   accessModes:
#     - ReadWriteOnce
#   size: 10Gi
EOF
  ]
}

Let's talk about Grafana

If you want dive into Grafana, you will face up lots of topic to learn, including

Dashboard
Alert
Authentication
Integrating with tons of data sources
…

Note

Therefore, It's making Grafana to become a good option when choose the MnO whole system for free. You can use Grafana Cloud for enterprise or managed from Azure/AWS Cloud, but it's up to you

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

If you want to setup full dashboard for AKS, you can concern a bit with these dashboard to implementing inside your cluster

Ingress Dashboard (AKS use Ingress): Usage: Used to look about quality and latency request, concurrency, …
PostgreSQL Dashboard: Cuz Grafana provide datasource postgres and gain permit to query performance metrics from PostgreSQL so you can leverage and create a couple dashboard for yourself. Explore at Integration Performance Query for MySQL or PostgreSQL
MongoDB Dashboard: With Prometheus permit to use exporter, so you can install exporter to expose metric from MongoDB cluster and Prometheus can scrape and you visualise it into Grafana. Explore at mongodb_exporter
RabbitMQ Dashboard: Same with MongoDB, you can install exporter for RabbitMQ. Explore at Monitoring with Prometheus and Grafana
Elasticsearch Dashboard: Same with MongoDB, RabbitMQ, you can install exporter for Elasticsearch. Explore at elasticsearch_exporter
Redis Dashboard: Can also install exporter for Redis. Explore at redis_exporter
…

If you want dive into Alert system with Grafana, don't forget to check it out my blog Deploy your alert with Grafana by Terraform and some common error with K8s

Troubleshoot MnO System

Can't not read data

Error : Occur when your storage has problems with any components

Note

All components in MnO system are already store data to azure-disk with name contain the service

Troubleshoot : Check on dashboard AKS portal about component, Does component attach with azure-disk ? Does the azure-disk exist or not for service ? Can attach that for your service via value helm-chart ?

Disconnection with prometheus

Error : Prometheus is restart or not running in currently

Troubleshoot : Wait 30s for Prometheus restart and query dashboard again. If not, refer to kubectl command to check status of Prometheus

Queries is not valid or datasource have problem

Error : Occur when using wrong queries or datasource have not response.

Troubleshoot : Check queries again, if not, please use kubectl to look up what happen with datasource you want to search (Ex: Loki, Tempo,…)

Loki with long queries time out

Error : Error when set time queries range to large or log storage in this time range out of range.

Troubleshoot : Reduce size of time range and choose specifically to increase exact log you want to check.

Cannot query metric from exporter

Error : Occur when exporter is restart, failure or DNS scraping not working

Troubleshoot:

Check status of exporter, if restart you need wait few second or delete pod for restart it if state is failure
Double check scrape configuration on prometheus helm-chart , DNS of exporter matching prefix <name-of-service>.<namespace>.svc:<port-service>

# Prometheus
prometheus:
  prometheusSpec:
    nodeSelector:
      pool: infrapool
    storageSpec:
      volumeClaimTemplate:
        spec:
          # Use Azure Disk or Azure files (NFS) for Prometheus
          storageClassName: default
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: ${var.prometheus_storage_size}
    additionalScrapeConfigs:
    - job_name: 'mongodb-exporter'
      static_configs:
        - targets: ['mongo-agent-prometheus-mongodb-exporter.monitoring.svc:9216']
    - job_name: 'nginx-ingress-metrics'
      static_configs:
        - targets: ['ingress-nginx-controller-metrics.nginx-ingress.svc:10254']
    - job_name: 'rabbitmq-exporter'
      static_configs:
        - targets: ['rabbitmq.infrastructure.svc:15692']
    - job_name: 'es-exporter'
      static_configs:
        - targets: ['es-agent-prometheus-elasticsearch-exporter.monitoring.svc:9108']
    - job_name: 'redis-exporter'
      static_configs:
        - targets: ['redis-agent-prometheus-redis-exporter.monitoring.svc:9121']
    enableFeatures:
      - remote-write-receiver

Alert announce `NULL` information

Error : Occur when your alert is spam with over length of query loki

Troubleshoot :

Turn on the silent mode for spam alert
Restart or removing when it occurs error with confusion exception and create again
If check the exception over length, Silent alert and ignore them

Conclusion

Note

On this topic, you will understand about

Monitoring and Observability of System
Install and setup Monitoring and Observability System
Dashboard of Grafana
Alert System by Grafana
Trouble the Monitoring and Observability System

Devops Training Session 16: Setup Observability for Kubernetes

tags: devops research tutorials

Monitoring and Observability for System

Install and setup MnO system

Let's talk about Grafana

Troubleshoot MnO System

Can't not read data

Disconnection with prometheus

Queries is not valid or datasource have problem

Loki with long queries time out

Cannot query metric from exporter

Alert announce NULL information

Conclusion

Reference

Read more

LFI and Path Traversal

Snyk vs Sonarqube - Securing your code

Hackwekend Session 5 - Cloud Security (AWS IAM Policy)

Hackwekend Session 6 - Cloud Security (Network and Red Team)

tags: `devops` `research` `tutorials`

Alert announce `NULL` information