devops
research
tutorials
Note
For purpose monitoring and observability the Kubernetes Cluster, nowaday you have many options to handle these configuration, such as
Each Tools have different uses like
Most of components on MnO system are open-source project, they provide helm-chart
that help you easily set up for your cluster. You can check about them on
Tip
You have multiple ways to apply this chart into your cluster, in the good condition, you can use helm
with terraform
through helm-release
provider that truly insane
For example, you can configure loki
like
loki/main.tf
resource "kubernetes_secret" "loki_storage_account" {
metadata {
name = "storage-account-key"
namespace = "monitoring"
}
data = {
STORAGE_ACCOUNT_KEY = var.remote_state.loki_storage_account_key
STORAGE_ACCOUNT_NAME = var.remote_state.loki_storage_account_name
}
}
resource "helm_release" "loki" {
name = "loki"
repository = "https://grafana.github.io/helm-charts"
chart = "loki"
version = "2.13.3"
create_namespace = true
namespace = "monitoring"
values = [
<<EOF
extraArgs:
config.expand-env: true
extraEnvFrom:
- secretRef:
name: storage-account-key
rbac:
pspEnabled: false
nodeSelector:
pool: infrapool
config:
server:
grpc_server_max_recv_msg_size: ${var.max_recv_sent_msg_size_server}
grpc_server_max_send_msg_size: ${var.max_recv_sent_msg_size_server}
schema_config:
configs:
- from: 2022-07-18
store: boltdb-shipper
object_store: azure
schema: v11
index:
prefix: loki_index_
period: 24h
querier:
query_timeout: 5m
engine:
timeout: 8m
storage_config:
boltdb_shipper:
shared_store: azure
active_index_directory: /data/loki/boltdb-shipper-active
cache_location: /data/loki/boltdb-shipper-cache
cache_ttl: 24h
filesystem:
directory: /data/loki/chunks
# To configure the long-term storage to be Azure Blob Storage
azure:
# Name of container under the storage account
container_name: loki
# Name of storage account
account_name: $${STORAGE_ACCOUNT_NAME}
# Storage account key as an env from secret
account_key: $${STORAGE_ACCOUNT_KEY}
request_timeout: 0
# Needed for Alerting: https://grafana.com/docs/loki/latest/rules/
# This is just a simple example, for more details: https://grafana.com/docs/loki/latest/configuration/#ruler_config
ruler:
storage:
type: local
local:
directory: /tmp/rules
rule_path: /tmp/scratch
alertmanager_url: http://kube-prometheus-stack-alertmanager.monitoring.svc:9093
enable_alertmanager_v2: true
ring:
kvstore:
store: inmemory
enable_api: true
alerting_groups:
- name: example
rules:
- alert: HighThroughputLogStreams
expr: sum by(container) (rate({job=~"loki-dev/.*"}[1m])) > 1000
for: 2m
EOF
]
depends_on = [
kubernetes_secret.loki_storage_account
]
}
After apply terraform, your component of MnO will release to namespace=monitoring
You can handle with the same idea with Grafana
, Prometheus
, Promtail
, Tempo
and Pyroscope
kube_stack_prometheus/main.tf
resource "helm_release" "kube-prometheus-stack" {
name = "kube-prometheus-stack"
repository = "https://prometheus-community.github.io/helm-charts"
chart = "kube-prometheus-stack"
version = "41.1.0"
create_namespace = true
namespace = "monitoring"
values = [
<<EOF
# Since AKS is a managed Kubernetes service, it doesn’t allow you to see internal components such as the etcd store, the controller manager, the scheduler, etc.
# See https://techcommunity.microsoft.com/t5/apps-on-azure-blog/using-azure-kubernetes-service-with-grafana-and-prometheus/ba-p/3020459
kubeEtcd:
enabled: false
kubeControllerManager:
enabled: false
kubeScheduler:
enabled: false
kubeProxy:
enabled: false
# State metrics
kube-state-metrics:
image:
repository: registry.k8s.io/kube-state-metrics/kube-state-metrics
tag: v2.10.0
nodeSelector:
pool: infrapool
# Prometheus operator
prometheusOperator:
nodeSelector:
pool: infrapool
# Alertmanager
alertmanager:
enabled: false
# alertmanagerSpec:
# nodeSelector:
# pool: infrapool
# storage:
# volumeClaimTemplate:
# spec:
# accessModes: ["ReadWriteOnce"]
# resources:
# requests:
# storage: ${var.alertmanager_storage_size}
# Prometheus
prometheus:
prometheusSpec:
nodeSelector:
pool: infrapool
storageSpec:
volumeClaimTemplate:
spec:
# Use Azure Disk or Azure files (NFS) for Prometheus
storageClassName: default
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: ${var.prometheus_storage_size}
additionalScrapeConfigs:
- job_name: 'mongodb-exporter'
static_configs:
- targets: ['mongo-agent-prometheus-mongodb-exporter.monitoring.svc:9216']
- job_name: 'nginx-ingress-metrics'
static_configs:
- targets: ['ingress-nginx-controller-metrics.nginx-ingress.svc:10254']
- job_name: 'rabbitmq-exporter'
static_configs:
- targets: ['rabbitmq.infrastructure.svc:15692']
- job_name: 'es-exporter'
static_configs:
- targets: ['es-agent-prometheus-elasticsearch-exporter.monitoring.svc:9108']
- job_name: 'redis-exporter'
static_configs:
- targets: ['redis-agent-prometheus-redis-exporter.monitoring.svc:9121']
enableFeatures:
- remote-write-receiver
# Grafana
grafana:
image:
tag: "10.0.0"
adminPassword: ${random_string.grafana_admin_password.result}
rbac:
pspEnabled: false
persistence:
enabled: true
size: ${var.grafana_storage_size}
nodeSelector:
pool: infrapool
ingress:
enabled: true
ingressClassName: nginx
annotations: {}
labels: {}
path: /
pathType: Prefix
hosts:
- ${var.monitoring_domain_config}
extraPaths: []
tls:
- secretName: https-certificate-monitoring
hosts:
- ${var.monitoring_domain_config}
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://kube-prometheus-stack-prometheus:9090/
access: proxy
isDefault: false
editable: true
# - name: AlertManager
# type: alertmanager
# uid: alertmanager
# url: http://kube-stack-prometheus-kube-alertmanager.monitoring:9093/
# access: proxy
# isDefault: false
# editable: true
# jsonData:
# implementation: prometheus
# exemplarTraceIdDestinations:
# - name: trace_id
# datasourceUid: tempo
# urlDisplayLabel: View in tempo
- name: loki
type: loki
uid: loki_svc
url: http://loki.monitoring.svc:3100
access: proxy
isDefault: false
editable: true
jsonData:
manageAlerts: true
alertmanagerUid: alertmanager
derivedFields:
- datasourceUid: tempo
matcherRegex: ((\w+){16})
name: trace_id
url: '$${__value.raw}'
- name: tempo
type: tempo
uid: tempo
url: http://tempo.monitoring.svc:3100
access: proxy
basicAuth: false
isDefault: false
editable: true
jsonData:
httpMethod: GET
tracesToLogs:
datasourceUid: "loki"
filterByTraceID: true
filterBySpanID: true
lokiSearch:
datasourceUid: 'loki'
EOF
]
}
promtail/main.tf
resource "helm_release" "promtail" {
name = "promtail"
repository = "https://grafana.github.io/helm-charts"
chart = "promtail"
version = "6.2.2"
create_namespace = true
namespace = "monitoring"
values = [
<<EOF
config:
clients:
- url: http://loki:3100/loki/api/v1/push
snippets:
pipelineStages:
- cri: {}
# - labeldrop:
# - filename
- match:
selector: '{namespace="default"}'
stages:
- json:
expressions:
TimeStamp: TimeStamp
level: Level
MessageTemplate: MessageTemplate
- labels:
level:
MessageTemplate:
- timestamp:
source: TimeStamp
format: RFC3339Nano
- match:
selector: '{namespace="monitoring"}'
stages:
- regex:
expression: 'level=(?P<level>\w+)'
- labels:
level:
EOF
]
}
pyroscope/main.tf
resource "helm_release" "pyroscope" {
name = "pyroscope-agent"
namespace = "monitoring"
repository = "https://grafana.github.io/helm-charts"
chart = "pyroscope"
version = "1.4.0"
values = [
<<EOF
pyroscope:
nodeSelector:
pool: infrapool
persistence:
enabled: true
agent:
enabled: false
EOF
]
}
tempo/main.tf
resource "helm_release" "tempo" {
name = "tempo"
repository = "https://grafana.github.io/helm-charts"
chart = "tempo"
version = "1.6.1"
create_namespace = true
namespace = "monitoring"
values = [
<<EOF
nodeSelector:
pool: infrapool
tempo:
memBallastSizeMbs: 1024
multitenancyEnabled: false
reportingEnabled: true
metricsGenerator:
enabled: true
remoteWriteUrl: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090/api/v1/write"
ingester: {}
querier: {}
queryFrontend: {}
retention: 24h
global_overrides:
max_bytes_per_trace: ${var.max_bytes_per_trace}
per_tenant_override_config: /conf/overrides.yaml
overrides: {}
server:
http_listen_port: 3100
storage:
trace:
blocklist_poll_tenant_index_builders: 1
blocklist_poll_jitter_ms: 500
backend: azure
azure:
container_name: tempo
storage_account_name: ${var.remote_state.tempo_storage_account_name}
storage_account_key: ${var.remote_state.tempo_storage_account_key}
wal:
path: /var/tempo/wal
receivers:
jaeger:
protocols:
grpc:
endpoint: 0.0.0.0:14250
thrift_binary:
endpoint: 0.0.0.0:6832
thrift_compact:
endpoint: 0.0.0.0:6831
thrift_http:
endpoint: 0.0.0.0:14268
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
extraArgs: {}
extraEnv: []
extraEnvFrom: []
extraVolumeMounts: []
config: |
multitenancy_enabled: {{ .Values.tempo.multitenancyEnabled }}
usage_report:
reporting_enabled: {{ .Values.tempo.reportingEnabled }}
compactor:
compaction:
block_retention: {{ .Values.tempo.retention }}
distributor:
receivers:
{{- toYaml .Values.tempo.receivers | nindent 8 }}
ingester:
{{- toYaml .Values.tempo.ingester | nindent 6 }}
server:
{{- toYaml .Values.tempo.server | nindent 6 }}
storage:
{{- toYaml .Values.tempo.storage | nindent 6 }}
querier:
{{- toYaml .Values.tempo.querier | nindent 6 }}
query_frontend:
{{- toYaml .Values.tempo.queryFrontend | nindent 6 }}
overrides:
{{- toYaml .Values.tempo.global_overrides | nindent 6 }}
{{- if .Values.tempo.metricsGenerator.enabled }}
metrics_generator_processors:
- 'service-graphs'
- 'span-metrics'
metrics_generator:
storage:
path: "/tmp/tempo"
remote_write:
- url: {{ .Values.tempo.metricsGenerator.remoteWriteUrl }}
{{- end }}
tempoQuery:
enabled: true
extraArgs: {}
extraEnv: []
extraVolumeMounts: []
# persistence:
# enabled: false
# # storageClassName: local-path
# accessModes:
# - ReadWriteOnce
# size: 10Gi
EOF
]
}
If you want dive into Grafana
, you will face up lots of topic to learn, including
Note
Therefore, It's making Grafana to become a good option when choose the MnO whole system for free. You can use Grafana Cloud
for enterprise or managed from Azure/AWS Cloud, but it's up to you
If you want to setup full dashboard for AKS, you can concern a bit with these dashboard to implementing inside your cluster
Grafana
provide datasource postgres and gain permit to query performance metrics from PostgreSQL
so you can leverage and create a couple dashboard for yourself. Explore at Integration Performance Query for MySQL or PostgreSQLPrometheus
permit to use exporter, so you can install exporter to expose metric from MongoDB
cluster and Prometheus
can scrape and you visualise it into Grafana
. Explore at mongodb_exporterMongoDB
, you can install exporter for RabbitMQ
. Explore at Monitoring with Prometheus and GrafanaMongoDB
, RabbitMQ
, you can install exporter for Elasticsearch
. Explore at elasticsearch_exporterRedis
. Explore at redis_exporterIf you want dive into Alert
system with Grafana, don't forget to check it out my blog Deploy your alert with Grafana by Terraform and some common error with K8s
Error : Occur when your storage has problems with any components
Note
All components in MnO system are already store data to azure-disk with name contain the service
Troubleshoot : Check on dashboard AKS portal about component, Does component attach with azure-disk ? Does the azure-disk exist or not for service ? Can attach that for your service via value helm-chart
?
Error : Prometheus is restart or not running in currently
Troubleshoot : Wait 30s for Prometheus restart and query dashboard again. If not, refer to kubectl
command to check status of Prometheus
Error : Occur when using wrong queries or datasource have not response.
Troubleshoot : Check queries again, if not, please use kubectl
to look up what happen with datasource you want to search (Ex: Loki, Tempo,…)
Error : Error when set time queries range to large or log storage in this time range out of range.
Troubleshoot : Reduce size of time range and choose specifically to increase exact log you want to check.
Error : Occur when exporter is restart, failure or DNS scraping not working
Troubleshoot:
pod
for restart it if state is failurehelm-chart
, DNS of exporter matching prefix <name-of-service>.<namespace>.svc:<port-service>
# Prometheus
prometheus:
prometheusSpec:
nodeSelector:
pool: infrapool
storageSpec:
volumeClaimTemplate:
spec:
# Use Azure Disk or Azure files (NFS) for Prometheus
storageClassName: default
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: ${var.prometheus_storage_size}
additionalScrapeConfigs:
- job_name: 'mongodb-exporter'
static_configs:
- targets: ['mongo-agent-prometheus-mongodb-exporter.monitoring.svc:9216']
- job_name: 'nginx-ingress-metrics'
static_configs:
- targets: ['ingress-nginx-controller-metrics.nginx-ingress.svc:10254']
- job_name: 'rabbitmq-exporter'
static_configs:
- targets: ['rabbitmq.infrastructure.svc:15692']
- job_name: 'es-exporter'
static_configs:
- targets: ['es-agent-prometheus-elasticsearch-exporter.monitoring.svc:9108']
- job_name: 'redis-exporter'
static_configs:
- targets: ['redis-agent-prometheus-redis-exporter.monitoring.svc:9121']
enableFeatures:
- remote-write-receiver
NULL
informationError : Occur when your alert is spam with over length of query loki
Troubleshoot :
silent
mode for spam alertSilent
alert and ignore themNote
On this topic, you will understand about