本篇介紹如何使用OpenShift Logging 6.x 搭配Loki 實現Log-based Alert - OpenShift 4.16+ - Logging Operator 6+ - Loki Operator 6+ - Cluster Observability Operator ![image](https://hackmd.io/_uploads/B1eI9tosxg.png) 邏輯架構大致如下 ![image](https://hackmd.io/_uploads/HkLbntjoll.png) 1. 參考 https://docs.redhat.com/en/documentation/red_hat_openshift_logging/6.3/html/installing_logging/installing-logging#installing-loki-and-logging-gui_installing-logging 安裝 - Logging Operator 6+ - Loki Operator 6+ - Cluster Observability Operator 並建立LokiStack, 需留意要安裝在手冊內指定的Namespace 並且在spec.rules啟用rules功能 rules內的selector用來規範loki rule 查詢哪些符合標準的alertingRule LokiStack ```yaml= apiVersion: loki.grafana.com/v1 kind: LokiStack metadata: annotations: name: lokistack-sample namespace: openshift-logging spec: hashRing: type: memberlist limits: global: queries: queryTimeout: 3m managementState: Managed rules: enabled: true selector: matchLabels: openshift.io/cluster-monitoring: 'true' size: 1x.extra-small storage: schemas: - effectiveDate: '2020-10-11' version: v11 secret: credentialMode: static name: logging-loki-s3 type: s3 storageClassName: gp2-csi tenants: mode: openshift-logging ``` 完成安裝 Cluster logging operator後,建立clusterlogforwarder **需留意要特別在spec.inputs.infrastructure內指定來源,要加上node, 才會有journal log** Cluster Log forwarder ```yaml= apiVersion: observability.openshift.io/v1 kind: ClusterLogForwarder metadata: name: instance namespace: openshift-logging spec: inputs: - infrastructure: sources: - node # <------- 一定要有這一段 - container name: infra-logs type: infrastructure managementState: Managed outputs: - lokiStack: authentication: token: from: serviceAccount target: name: lokistack-sample namespace: openshift-logging name: lokistack-out tls: ca: configMapName: openshift-service-ca.crt key: service-ca.crt type: lokiStack pipelines: - inputRefs: - application - infra-logs name: infra-app-logs outputRefs: - lokistack-out serviceAccount: name: logging-collector ``` 建立Loki的AlertingRule **一定要加上label, 且要與rule內的selector吻合** AlertingRule 這個規則會過濾所有Lokistack內 `log_type=infrastructure`的log , 選出含有 `soft lockup` 的log , 用json格式轉換後 ,最後只列出 `log_source=node`的log 只要一天內出現一次就會發出告警 ```yaml= apiVersion: loki.grafana.com/v1 kind: AlertingRule metadata labels: openshift.io/cluster-monitoring: 'true name: loki-operator-alerts-01 namespace: openshift-logging spec: groups: - interval: 1m name: soft-lockup-alert rules: - alert: SoftLockupDetected annotations: description: | Watchdog BUG: soft lockup found in logs on node - {{ $labels.k8s_node_name }} . Full message - {{ $labels.message }} summary: Soft lockup detected in kernel logs expr: | count_over_time( { log_type=~"infrastructure" } |~ "soft lockup" | json | log_source="node" [1d] ) > 0 for: 0s labels: severity: critical tenantID: infrastructure ``` 套用完成後會在OpenShift內的Alert頁面看到 ![image](https://hackmd.io/_uploads/HJoOAtijeg.png) 我們可以手動在節點上手動產生符合條件的log ```yaml= [lab-user@bastion ~]$ oc debug node/ip-10-0-53-13.ap-southeast-1.compute.internal Temporary namespace openshift-debug-kq82v is created for debugging node... Starting pod/ip-10-0-53-13ap-southeast-1computeinternal-debug-xkj22 ... To use host binaries, run `chroot /host` Pod IP: 10.0.53.13 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" sh-5.1# logger -t kernel "watchdog: BUG: soft lockup - CPU#$((RANDOM % 32)) stuck for $((RANDOM % 500 + 100))s! [kube-rbac-proxy:$((RANDOM % 9000 + 1000))]" ``` 成功觸發 ![image](https://hackmd.io/_uploads/rkN8ddjsgx.png) ## 參考資料 Custom logging alerts - https://docs.redhat.com/en/documentation/red_hat_openshift_logging/6.3/html/logging_alerts/custom-logging-alerts-1#configuring-logging-loki-ruler_custom-logging- Cluster Log forwarder example - https://github.com/openshift/cluster-logging-operator/blob/0bbb53dc1ebbfa9838339ea5667d3982fd3f2095/docs/reference/samples/observability.inputs-app-audit-infra.yaml#L4