# SUSE Observability Cluster 部屬 * SUSE Observability(前身為 StackState)可用來觀察 Kubernetes 叢集及其工作負載。 * SUSE Observability 主要分為 Server 和 Agent 兩個部分,Server 負責儲存和展示數據,Agent 負責擷取資料並傳送給 Server。 Server 的元件有: 1. Topology (StackGraph) 1. Metrics (VictoriaMetrics) 1. Traces (ClickHouse) 1. Logs (ElasticSearch) ![image](https://hackmd.io/_uploads/rkq-ntBo1l.png) ## 1.安裝重點注意事項 1. 需要安裝 CSI - local-path-storage,如果沒有 default storage class,請把這個設為 default。 2. 因為是 hadoop cluster 的關係,node 會比較需要更多的 RAM。 3. trial mode 沒有 HA 機制。 4. 全環境請設定固定 IP。 5. traces 功能需要 AP 有使用 opentelementry 套件。 6. 如果忘了加入 local-path-storage 就建立 obs 的話,重裝的時候請先刪除錯誤的 PVC。 ### 2. IP & 資源紀錄 obsm1: 192.168.11.75 obsw1: 192.168.11.76 obsw2: 192.168.11.77 每個節點資源: 4 core,12G memory * 檢查名稱解析 ``` $ host obs1.example.com obs1.example.com has address 192.168.11.76 $ host otlp-stackstate.example.com otlp-stackstate.example.com has address 192.168.11.76 $ host otlp-http-stackstate.example.com otlp-http-stackstate.example.com has address 192.168.11.76 ``` ## 3. 安裝 local-path-storage 預設 storage backend * 安裝 local-path ``` $ wget -O - https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.22/deploy/local-path-storage.yaml | kubectl apply -f - ``` * 設定為 default storage,有些 PVC 建立的時候會需要 default backend。 ``` $ kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' $ kubectl get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE local-path (default) rancher.io/local-path Delete WaitForFirstConsumer false 349d ``` ## 4. helm install OBS 應用服務。 :::info 相關的 helm chart values 已經先 download 好,放在全離線包中。 如果沒有 download 過,請參考官網上的 air-gap 安裝模式。 ::: ### 4.1 前置作業產出 values 相關參數檔 :::warning 在 SCC 上獲取自己的註冊碼 ::: :::warning 注意 baseUrl,這個是存取 OBS 服務的網址,需要 DNS 相關設定。 請指向到 OBS Cluster 其中一個 worker 即可。 Production 請搭配 DNS+LB 進行部署。 ::: * 安裝 helm,透過 Helm 安裝,安裝最低版本至少 3.13.1 ``` $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 $ chmod 700 get_helm.sh $ ./get_helm.sh $ helm version version.BuildInfo{Version:"v3.17.1", GitCommit:"980d8ac1939e39138101364400756af2bdee1da5", GitTreeState:"clean", GoVersion:"go1.23.5"} ``` * 下載 helm chart ``` $ helm repo add suse-observability https://charts.rancher.com/server-charts/prime/suse-observability $ helm repo update $ helm fetch suse-observability/suse-observability $ helm fetch suse-observability/suse-observability-values $ ls -l | grep suse -rw-r--r-- 1 root root 561319 Mar 4 17:56 suse-observability-2.3.0.tgz -rw-r--r-- 1 root root 8420 Mar 4 17:57 suse-observability-values-1.0.7.tgz ``` * 此命令將產生一個 `$VALUES_DIR/suse-observability-values/templates/baseConfig_values.yaml` 文件,`$VALUES_DIR/suse-observability-values/templates/sizing_values.yaml` 其中包含安裝 SUSE Observability Helm Chart 所需的設定。 ``` $ export VALUES_DIR=. # 注意要替換為自己的 license key $ helm template \ --set license='xxxx-xxxx-xxxx' \ --set baseUrl='https://obs1.example.com' \ --set sizing.profile='trial' \ suse-observability-values suse-observability-values-1.0.7.tgz\ --output-dir $VALUES_DIR $ ls -l suse-observability-values/templates/ total 8 -rw-r--r-- 1 root root 511 Mar 4 17:58 baseConfig_values.yaml -rw-r--r-- 1 root root 3961 Mar 4 17:58 sizing_values.yaml ``` ### 4.2 確認 ingress 組態資訊(ingress-values.yaml) :::warning 注意 host,這個是存取 OBS 服務的網址,需要 DNS 相關設定。 請指向到 OBS Cluster 其中一個 worker 即可。 Production 請搭配 DNS+LB 進行部署。 ::: ``` $ vim ingress-values.yaml ingress: enabled: true annotations: nginx.ingress.kubernetes.io/proxy-body-size: "50m" hosts: - host: obs1.example.com ``` ### 4.3 確認 traces 組態資訊(ingress_otel_values.yaml) :::warning 注意 host,這個是存取 OBS trace 服務的網址,需要 DNS 相關設定。 請指向到 OBS Cluster 其中一個 worker 即可。 Production 請搭配 DNS+LB 進行部署。 ::: ``` $ vim ingress_otel_values.yaml opentelemetry-collector: ingress: enabled: true annotations: nginx.ingress.kubernetes.io/proxy-body-size: "50m" nginx.ingress.kubernetes.io/backend-protocol: GRPC hosts: - host: otlp-stackstate.example.com paths: - path: / pathType: Prefix port: 4317 additionalIngresses: - name: otlp-http annotations: nginx.ingress.kubernetes.io/proxy-body-size: "50m" hosts: - host: otlp-http-stackstate.example.com paths: - path: / pathType: Prefix port: 4318 ``` ### 4.4 email 設定 ``` $ vim mail.yaml stackstate: email: enabled: true sender: "obs@lab.com" server: host: "smtp.google.com" port: 25 protocol: smtp auth: username: "null" password: "null" ``` ### 4.5 安裝 OBS Cluster ``` $ helm upgrade --install \ --namespace suse-observability \ --create-namespace \ --values $VALUES_DIR/suse-observability-values/templates/baseConfig_values.yaml \ --values $VALUES_DIR/suse-observability-values/templates/sizing_values.yaml \ --values ingress-values.yaml \ --values ingress_otel_values.yaml \ --values mail.yaml \ suse-observability \ suse-observability-2.3.0.tgz ``` * 確認 pod 是否都正常運行。 ``` $ kubectl -n suse-observability get pod NAME READY STATUS RESTARTS AGE suse-observability-backup-conf-05t132644-s6c2p 0/1 Completed 0 6m48s suse-observability-clickhouse-shard0-0 2/2 Running 0 6m48s suse-observability-correlate-6b6bfdf686-k7k2n 1/1 Running 3 (2m41s ago) 6m48s suse-observability-e2es-9586bcddb-nj5vt 1/1 Running 2 (3m4s ago) 6m48s suse-observability-elasticsearch-master-0 1/1 Running 0 6m48s suse-observability-hbase-stackgraph-0 1/1 Running 1 (5m30s ago) 6m48s suse-observability-hbase-tephra-0 1/1 Running 0 6m48s suse-observability-kafka-0 2/2 Running 2 (5m21s ago) 6m37s suse-observability-kafkaup-operator-kafkaup-84658fb49d-c72dw 1/1 Running 0 6m48s suse-observability-otel-collector-0 1/1 Running 0 6m48s suse-observability-prometheus-elasticsearch-exporter-6fb6bpxtlk 1/1 Running 1 (5m49s ago) 6m48s suse-observability-receiver-58b7bc9fbc-cqd25 1/1 Running 3 (2m16s ago) 6m48s suse-observability-router-7f964cfbc4-mnjjv 1/1 Running 1 (5m47s ago) 6m48s suse-observability-server-5578945476-w9qwg 1/1 Running 3 (35s ago) 6m48s suse-observability-topic-create-05t132644-p9mtp 0/1 Completed 0 6m48s suse-observability-ui-59c88887db-xr8pm 2/2 Running 0 6m48s suse-observability-victoria-metrics-0-0 1/1 Running 0 6m48s suse-observability-vmagent-0 1/1 Running 0 6m48s suse-observability-zookeeper-0 1/1 Running 0 6m48s $ kubectl -n suse-observability get ing NAME CLASS HOSTS ADDRESS PORTS AGE suse-observability <none> obs1.example.com 192.168.11.75,192.168.11.76,192.168.11.77 80 27h suse-observability-otel-collector <none> otlp-stackstate.example.com 192.168.11.75,192.168.11.76,192.168.11.77 80 27h suse-observability-otel-collector-otlp-http <none> otlp-http-stackstate.example.com 192.168.11.75,192.168.11.76,192.168.11.77 80 3m1s ``` * 獲取 API_KEY ``` $ kubectl -n suse-observability get secret suse-observability-api-key -o jsonpath='{.data.API_KEY}' | base64 -d 6MkFG3Ve942N9tdRChlvKzKRBwrlj2Ci ``` ## 使用 ingress 登入 ![image](https://hackmd.io/_uploads/BJKFObdskx.png) * 預設帳號是 admin,密碼會放在 `baseConfig_values.yaml` 檔案下 ``` $ cat suse-observability-values/templates/baseConfig_values.yaml | grep password # Your SUSE Observability admin password is: x83LJNMco5Yzvo44 ``` ![image](https://hackmd.io/_uploads/ByWttNSokx.png) ## 5. 加入監控叢集 :::info agent cluster 註冊指令,這邊就是在要監控的下游叢集安裝 agent,來讓 OBS 可以監控。 ::: * 如果進入 OBS UI 沒有看到如圖畫面,那麼可能是 license 問題,可以檢查一下 `suse-observability-server` pod 是否有相關錯誤。 * 左上角點選 kubernetes ![image](https://hackmd.io/_uploads/Hyh32FSjyx.png) * 進入 OBS UI 後填選要加入的叢集名稱 ![image](https://hackmd.io/_uploads/rkFTGvHikx.png) * 下方會有提供 agent cluster 註冊指令 ![image](https://hackmd.io/_uploads/HytQQvBi1l.png) * 獲取 helm chart ``` $ helm repo add suse-observability https://charts.rancher.com/server-charts/prime/suse-observability $ helm repo update ``` 1. 修改指令並 skip 一些憑證驗證 2. 在要監控的下游叢集執行安裝以下指令 ``` $ helm upgrade --install \ --namespace suse-observability \ --create-namespace \ --set-string 'stackstate.apiKey'='6MkFG3Ve942N9tdRChlvKzKRBwrlj2Ci' \ --set-string 'stackstate.cluster.name'='rke2' \ --set-string 'stackstate.url'='https://obs1.example.com/receiver/stsAgent' \ --set-string 'global.skipSslValidation'='true' \ --set-string 'nodeAgent.skipSslValidation'='true' \ --set-string 'clusterAgent.skipSslValidation'='true' \ --set-string 'logsAgent.skipSslValidation'='true' \ --set-string 'checksAgent.skipSslValidation'='true' \ suse-observability-agent suse-observability/suse-observability-agent ``` ``` $ kubectl -n suse-observability get pod NAME READY STATUS RESTARTS AGE suse-observability-agent-checks-agent-56b6c94bdd-tscg8 1/1 Running 1 (58s ago) 2m6s suse-observability-agent-cluster-agent-7d87c4dbc7-mbpvl 1/1 Running 1 (92s ago) 2m6s suse-observability-agent-logs-agent-294gt 1/1 Running 0 2m6s suse-observability-agent-logs-agent-qmttk 1/1 Running 0 2m6s suse-observability-agent-logs-agent-rf9rp 1/1 Running 0 2m6s suse-observability-agent-node-agent-g88bd 2/2 Running 1 (57s ago) 2m6s suse-observability-agent-node-agent-qxn7t 2/2 Running 0 2m6s suse-observability-agent-node-agent-vlwtx 2/2 Running 0 2m6s ``` * 部屬好後回到 OBS UI 確認是否已加入完畢 ![image](https://hackmd.io/_uploads/SJKb8vSs1e.png) * 就可以觀察到 agent cluster 叢集數據 ![image](https://hackmd.io/_uploads/SyJquDriJx.png) ## 6. 安裝 open telemetry collector * 如果需要收集 AP 的 traces 資料,就需要安裝 open telemetry collector * 以下都是在 agent cluster 執行 ``` $ vim otel-collector.yaml extraEnvsFrom: - secretRef: name: open-telemetry-collector mode: deployment image: repository: "otel/opentelemetry-collector-k8s" ports: metrics: enabled: true presets: kubernetesAttributes: enabled: true extractAllPodLabels: true config: extensions: bearertokenauth: scheme: SUSEObservability token: "6MkFG3Ve942N9tdRChlvKzKRBwrlj2Ci" # 注意需更換自己的 api-key exporters: otlp/stackstate: auth: authenticator: bearertokenauth endpoint: otlp-stackstate.example.com:443 tls: insecure: false insecure_skip_verify: true otlphttp/stackstate: auth: authenticator: bearertokenauth endpoint: otlp-http-stackstate.example.com:4318 tls: insecure: false insecure_skip_verify: true processors: tail_sampling: decision_wait: 10s policies: - name: rate-limited-composite type: composite composite: max_total_spans_per_second: 500 policy_order: [errors, slow-traces, rest] composite_sub_policy: - name: errors type: status_code status_code: status_codes: [ ERROR ] - name: slow-traces type: latency latency: threshold_ms: 1000 - name: rest type: always_sample rate_allocation: - policy: errors percent: 33 - policy: slow-traces percent: 33 - policy: rest percent: 34 resource: attributes: - key: k8s.cluster.name action: upsert value: rke2 # 注意需修改為你的叢集名稱 - key: service.instance.id from_attribute: k8s.pod.uid action: insert filter/dropMissingK8sAttributes: error_mode: ignore traces: span: - resource.attributes["k8s.node.name"] == nil - resource.attributes["k8s.pod.uid"] == nil - resource.attributes["k8s.namespace.name"] == nil - resource.attributes["k8s.pod.name"] == nil connectors: spanmetrics: metrics_expiration: 5m namespace: otel_span routing/traces: error_mode: ignore table: - statement: route() pipelines: [traces/sampling, traces/spanmetrics] service: extensions: - health_check - bearertokenauth pipelines: traces: receivers: [otlp] processors: [filter/dropMissingK8sAttributes, memory_limiter, resource] exporters: [routing/traces] traces/spanmetrics: receivers: [routing/traces] processors: [] exporters: [spanmetrics] traces/sampling: receivers: [routing/traces] processors: [tail_sampling, batch] exporters: [debug, otlp/stackstate] metrics: receivers: [otlp, spanmetrics, prometheus] processors: [memory_limiter, resource, batch] exporters: [debug, otlp/stackstate] ``` ``` $ helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts ``` * 安裝時填入自己的 stackstate-api-key ``` $ kubectl create ns open-telemetry $ kubectl create secret generic open-telemetry-collector \ --namespace open-telemetry \ --from-literal=API_KEY='6MkFG3Ve942N9tdRChlvKzKRBwrlj2Ci' $ helm upgrade \ --install opentelemetry-collector open-telemetry/opentelemetry-collector \ --values otel-collector.yaml \ --namespace open-telemetry ``` ``` $ kubectl -n open-telemetry get pod NAME READY STATUS RESTARTS AGE opentelemetry-collector-865cdcdf5b-v2pr5 1/1 Running 0 96s ``` ## 驗證 trace 功能 1. 下載 task 工具 ``` $ wget https://github.com/go-task/task/releases/download/v3.41.0/task_linux_amd64.tar.gz $ tar -zxvf task_linux_amd64.tar.gz $ cp task /usr/local/bin/ ``` 2. 下載 stackstate 使用的 sample ``` $ git clone https://github.com/ravan/observability-hands-on; cd observability-hands-on ``` * 先建立 `.env` 檔,此檔案請存放於 observability-hands-on 資料夾中。 * 重點在於以下兩個參數 `KUBECONFIG_FILE_PATH`、`KUBECONFIG_FILE_NAME`,主要是確認 kubeconfig 位置與檔案,在此使用常規的位置。 * 也需要更換自己的 `STS_API_KEY` ``` $ vim .env STS_API_KEY=6MkFG3Ve942N9tdRChlvKzKRBwrlj2Ci LOCAL_CLUSTER=false CLUSTER_NAME=rke2 CLUSTER_OTLP_HTTP_ENDPOINT=otlp-stackstate.example.com:4318 KUBECONFIG_FILE_PATH=/root/.kube KUBECONFIG_FILE_NAME=config HELM_REPO=stackstate-addons ``` * 修改 values.yaml :::danger 注意!!! 這個位置不是直接指向 Observability cluster 對外的 ingress,是指向已經安裝 opentelemetry 的 collector service。 ::: ``` $ cd charts/dino-kiosk/ $ vim values.yaml blockQueue: 'no' nameOverride: '' fullnameOverride: '' otelHttpEndpoint: opentelemetry-collector.open-telemetry.svc.cluster.local:4318 # 修改此行 ``` 3. 執行 sample ``` rke2:~/observability-hands-on/charts/dino-kiosk # task labs:dino-kiosk:setup ``` * 在 `museum-dino-kiosk` namespace 下就會看到啟動這些 pod。 ``` $ kubectl -n museum-dino-kiosk get pod NAME READY STATUS RESTARTS AGE ai-sim-service-85d4888d6-m24l4 1/1 Running 0 75s build-a-dino-7c698b4df4-sgcvt 1/1 Running 0 75s kiosk-visitors-56b4ff794f-2dtxc 1/1 Running 0 76s kiosk-web-5d59848c57-km8q6 1/1 Running 0 76s printer-3d-8499b76d95-llxcn 1/1 Running 0 76s printing-queue-bd7c9b856-r9lr5 1/1 Running 0 75s printing-service-7598cc487-d6clv 1/1 Running 0 76s shipping-5b844565d9-wbwwh 1/1 Running 0 76s ``` * 模擬錯誤 ``` rke2:~/observability-hands-on/charts/dino-kiosk # task labs:dino-kiosk:trigger ``` * 進到 OBS UI 選擇對應叢集與 namespace。 ![image](https://hackmd.io/_uploads/HyQnqb_sJg.png) * 進到指定 pod 後點選 Traces,就可以看到相關資訊 ![image](https://hackmd.io/_uploads/S1UCHf_sye.png) ## Rancher UI Extensions 在 Rancher 對接 SUSE Observability,URL 需要使用有效憑證(非自簽名憑證) * 在 OBS UI 左下角點選 CLI 複製以下安裝 sts 指令 ![image](https://hackmd.io/_uploads/HJN5O_Bsyg.png) ``` $ curl -o- https://dl.stackstate.com/stackstate-cli/install.sh | STS_URL="http://obs1.example.com" STS_API_TOKEN="HPoktLn_wz0J4Y7ffD0Eh7GN2dDOyc5V" bash $ sts service-token --help Manage service tokens. Usage: sts service-token [command] Available Commands: create Create a service token delete Delete a service token list List service tokens Use "sts service-token [command] --help" for more information about a command. ``` ``` $ sts service-token create --name my-service-token --roles stackstate-k8s-troubleshooter ✅ Service token created: svctok-M101Ky4Ol6wffCm6hFp23npL9klLTsLH ``` * 安裝 OBS extension ![image](https://hackmd.io/_uploads/S1oQE4roJl.png) * 點選 Install ![image](https://hackmd.io/_uploads/Hk7SE4BjJg.png) ![image](https://hackmd.io/_uploads/ryGcr4SoJe.png) ## 寄信功能設定 * 新增一個 notification ,channel 選擇 E-mail,設定後可以點擊 TEST 測試寄出信 ![image](https://hackmd.io/_uploads/ByCUhtInJx.png) ![image](https://hackmd.io/_uploads/rkm2hYUnke.png) ![image](https://hackmd.io/_uploads/BkRWGiUnyx.png) ## 解除安裝 * OBS Cluster ``` $ helm -n suse-observability uninstall suse-observability ``` * OBS Agent ``` $ helm -n suse-observability uninstall suse-observability-agent $ helm -n open-telemetry uninstall opentelemetry-collector ``` ## 參考 https://docs.stackstate.com/get-started/k8s-suse-rancher-prime#license-key https://play.stackstate.com/#/components/urn:kubernetes:%2Francher-rke2-cluster-0%2Fpod%2F2c829a95-cafc-413a-9476-a1b4b53b72e6/metrics?detachedFilters=cluster-name%3Arancher-rke2-cluster-0&timeRange=1741130921846_1741152521846