# ECK on AKS – Deployment & Troubleshooting Guide > This README summarizes the **end‑to‑end setup, validation, and troubleshooting** for deploying **Elastic Cloud on Kubernetes (ECK)** with **Elasticsearch + Kibana** on **AKS**, based on the current implementation and issues resolved so far. --- ## 1. Architecture Overview ### Components | Component | Namespace | Deployment Method | Purpose | | -------------- | ------------------- | ----------------- | ---------------------------------- | | ECK Operator | `elastic-system` | Helm | Manages lifecycle of Elastic Stack | | Elasticsearch | `elastic` | ECK CRD (YAML) | Search / storage engine | | Kibana | `elastic` | ECK CRD (YAML) | UI / visualization | | Ingress (PDNG) | `pdng` or `default` | YAML / Kustomize | API routing | ### Key Design Decisions * **Helm is used only for the ECK Operator** * **Elasticsearch / Kibana are managed via CRDs (not Helm)** * **Kustomize** is used to separate `base` and `env (dev/prod)` * **Elasticsearch is NOT exposed publicly** * **Kibana should use a dedicated host (recommended)** --- ## 2. Repository Structure ``` AKS/ ├─ helm/ │ └─ eck-operator/ │ ├─ Chart.yaml │ └─ values.yaml │ └─ yaml/ ├─ base/ │ ├─ kustomization.yaml │ ├─ elastic/ │ │ ├─ namespace.yaml │ │ ├─ elasticsearch.yaml │ │ └─ kibana.yaml │ ├─ pdng-ingress.yaml │ ├─ pdng-ingress-agw.yaml │ └─ other-secrets.yaml │ ├─ dev/ │ └─ kustomization.yaml └─ prod/ └─ kustomization.yaml ``` --- ## 3. Step‑by‑Step Deployment ### 3.1 Install ECK Operator (Helm) ```bash helm repo add elastic https://helm.elastic.co helm repo update cd AKS/helm/eck-operator helm dependency build helm upgrade --install eck-operator . \ -n elastic-system --create-namespace ``` ### Verify Operator ```bash kubectl -n elastic-system get pods kubectl get crd | Select-String elastic ``` Expected CRDs: * `elasticsearches.elasticsearch.k8s.elastic.co` * `kibanas.kibana.k8s.elastic.co` --- ### 3.2 Deploy Elastic Stack (Base) ```bash kubectl apply -k AKS/helm/elastic ``` This creates: * `Namespace elastic` * `Elasticsearch es` * `Kibana kb` --- ## 4. Validation Checklist ### 4.1 Elasticsearch ```bash kubectl -n elastic get elasticsearch ``` Expected: ``` NAME HEALTH NODES VERSION PHASE es green 1 8.13.x Ready ``` ### 4.2 Kibana ```bash kubectl -n elastic get kibana kubectl -n elastic get pods ``` Expected: * Pod: `Running 1/1` * Health: `green` --- ## 5. Kibana Readiness Issue (Root Cause & Fix) ### ❌ Observed Issue * Kibana pod stuck at `Ready: False` * `HEALTH = red` * Readiness probe fails on: ``` https://<pod-ip>:5601/login → 404 ``` ### 🔍 Root Cause Kibana was configured with: ```yaml server.basePath: /kibana server.rewriteBasePath: true ``` But **ECK readiness probe is hardcoded to `/login`**, not `/kibana/login`. Result: * Kibana is actually running * Readiness probe never succeeds --- ## 6. Recommended Fix (Production‑Grade) ### ✅ Use Dedicated Kibana Host (NO basePath) #### Kibana CRD (Fixed) ```yaml apiVersion: kibana.k8s.elastic.co/v1 kind: Kibana metadata: name: kb namespace: elastic spec: version: 8.13.4 count: 1 elasticsearchRef: name: es ``` #### Kibana Ingress (Dedicated Host) ```yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: kibana-ingress namespace: elastic annotations: nginx.ingress.kubernetes.io/backend-protocol: "HTTPS" spec: ingressClassName: pdng-nginx-ingress-class-cus-dev rules: - host: kibana-dev.corp.hpicloud.net http: paths: - path: / pathType: Prefix backend: service: name: kb-kb-http port: number: 5601 ``` ### Result * Readiness probe `/login` → ✅ 200 * Kibana Pod → `Ready` * Kibana Health → `green` --- ## 7. Access Credentials ### Elasticsearch / Kibana Default User ```bash kubectl -n elastic get secret es-es-elastic-user -o jsonpath='{.data.elastic}' | base64 --decode ``` User: ``` elastic ``` --- ## 8. Key Lessons Learned * ❌ Do NOT use path‑based ingress (`/kibana`) with ECK Kibana * ✅ Use **dedicated host** for Kibana * ✅ Keep Operator (Helm) and Stack (CRD) responsibilities separated * ✅ Always include Elastic resources in `base/kustomization.yaml` * ⚠️ Avoid deploying duplicate ingress rules across namespaces --- ## 9. Current Status | Component | Status | | ------------- | ------------------------- | | ECK Operator | ✅ Running | | Elasticsearch | ✅ Green | | Kibana | ⚠️ Needs basePath removal | | Ingress | ⚠️ Needs consolidation | --- # Kibana (ECK) + NGINX Ingress (Helm) 部署筆記(可直接複製使用) > 時區:Asia/Taipei > 範例環境:AKS + ECK(Elasticsearch/Kibana 8.13.4) > 目標:把 `elastic` namespace 的 Kibana 透過 **NGINX Ingress** 掛到 `https://pdng-dev.corp.hpicloud.net/kibana` > 備註:本文件整理的是我們剛剛實際完成/驗證過的步驟與指令。 --- ## 0. 前置條件與關鍵觀念 ### 0.1 Ingress 不能跨 namespace 指 Service - 你的 Kibana Service:`elastic` namespace 內的 `kb-kb-http` - 所以 Kibana 的 Ingress **必須**建立在 `elastic` namespace(不然一定 502 / 找不到 backend)。 ### 0.2 Kibana Service 5601 是 HTTPS(ECK 自簽 TLS) 你確認到的 Service 片段: - `ports[].name: https` - `port: 5601` 因此 NGINX Ingress 必須加: - `nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"` ### 0.3 用 `/kibana` 子路徑常見 404/資源路徑問題 Ingress 轉發成功後,如果看到 404 或頁面資源壞掉,多半是 **Kibana 沒設定 basePath**: - `server.basePath: "/kibana"` - `server.rewriteBasePath: true` - `server.publicBaseUrl: "https://<host>/kibana"` 建議先用 port-forward 確認 Kibana 本體 OK,再處理 basePath。 --- ## 1. 快速盤點現況(確認 ingress class / namespaces) ### 1.1 看現有 Ingress(包含 namespace) ```bash kubectl get ingress --all-namespaces | grep pdng-ingress ``` 你看到的重點(例): - `pdng` namespace:`pdng-ingress`(nginx class) & `pdng-ingress-agw`(AGW) - `elastic` namespace:Kibana / Elasticsearch (ECK CRD) --- ## 2. Kibana / Elasticsearch(ECK)CRD(你目前的樣子) ### 2.1 Kibana(ECK) ```yaml apiVersion: kibana.k8s.elastic.co/v1 kind: Kibana metadata: name: kb namespace: elastic spec: version: 8.13.4 count: 1 elasticsearchRef: name: es http: tls: selfSignedCertificate: disabled: false ``` ### 2.2 Elasticsearch(ECK) ```yaml apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: es namespace: elastic spec: version: 8.13.4 http: tls: selfSignedCertificate: disabled: false nodeSets: - name: default count: 1 config: node.store.allow_mmap: false ``` --- ## 3. 確認 Kibana Service(非常重要:service 名稱 + HTTPS) ```bash kubectl -n elastic get svc kb-kb-http -o yaml | sed -n '1,120p' ``` 你確認到: - Service 名稱:`kb-kb-http` - Port:`5601` - `ports[].name: https` → 代表 upstream 是 HTTPS --- ## 4. 用 Helm 產生 Kibana Ingress(核心步驟) > 你目前的 repo 結構是同一個 chart(`Chart.yaml` 在 `INFRASTRUCTURE/AKS/helm/`), > 並且 `templates/` 內新增 `kibana-ingress.yaml`。 > 所以 chart path 就是 `.`(在 Chart.yaml 同層)。 ### 4.1 Template:kibana-ingress.yaml(含 enabled 包裝) 把整個資源包起來: ```yaml {{- if .Values.kibanaIngress.enabled }} apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: {{ .Values.kibanaIngress.name | default "kibana-ingress" }} {{- if .Values.kibanaIngress.annotations }} annotations: {{- toYaml .Values.kibanaIngress.annotations | nindent 4 }} {{- end }} spec: {{- if .Values.kibanaIngress.ingressClassName }} ingressClassName: {{ .Values.kibanaIngress.ingressClassName }} {{- end }} rules: - {{- if .Values.kibanaIngress.host }} host: {{ .Values.kibanaIngress.host }} {{- end }} http: paths: - path: {{ .Values.kibanaIngress.path | default "/kibana(/|$)(.*)" }} pathType: {{ .Values.kibanaIngress.pathType | default "ImplementationSpecific" }} backend: service: name: {{ .Values.kibanaIngress.backend.service.name | default "kb-kb-http" }} port: number: {{ .Values.kibanaIngress.backend.service.port.number | default 5601 }} {{- if .Values.kibanaIngress.tlsEnabled }} tls: - hosts: - {{ .Values.kibanaIngress.host }} secretName: {{ .Values.kibanaIngress.tlsSecretName | default "kibana-tls" }} {{- end }} {{- end }} ``` ### 4.2 values/dev-values.yaml:新增 kibanaIngress 區塊 > 重點:backend 是 `kb-kb-http`,且 upstream 是 HTTPS,所以必須加 `backend-protocol: "HTTPS"`。 ```yaml kibanaIngress: enabled: true name: kibana-ingress host: pdng-dev.corp.hpicloud.net ingressClassName: pdng-nginx-ingress-class-cus-dev path: /kibana(/|$)(.*) pathType: ImplementationSpecific backend: service: name: kb-kb-http port: number: 5601 annotations: nginx.ingress.kubernetes.io/backend-protocol: "HTTPS" nginx.ingress.kubernetes.io/use-regex: "true" nginx.ingress.kubernetes.io/rewrite-target: /$2 # 建議先讓 TLS 由既有入口/平台處理;若你要 Ingress 自己宣告 tls secret 再開 true tlsEnabled: false ``` --- ## 5. Helm 部署指令(單行版) ### 5.1 先切到 Chart.yaml 同層 ```bash cd INFRASTRUCTURE/AKS/helm ``` ### 5.2 用 Helm 安裝到 elastic namespace(非常重要) ```bash helm upgrade --install kibana-ingress . -n elastic -f values/dev-values.yaml ``` > 為什麼一定要 `-n elastic`:Ingress backend service `kb-kb-http` 在 `elastic`。 --- ## 6. 驗證(kubectl 只用來觀察,不影響 Helm 管理) ### 6.1 看 Ingress 是否建立成功 ```bash kubectl -n elastic get ingress ``` ### 6.2 看 Ingress rule/backends/event 是否正常 ```bash kubectl -n elastic describe ingress kibana-ingress | sed -n '1,200p' ``` 期望看到: - Path:`/kibana(/|$)(.*)` - Backend:`kb-kb-http:5601` - Events:只有 Sync,沒有 Error ### 6.3 確認 endpoints 存在(沒有 endpoints 100% 502) ```bash kubectl -n elastic get endpoints kb-kb-http ``` --- ## 7. 如果 Ingress 打 `/kibana` 看到 404:先用 Port Forwarding 確認 Kibana 本體 ### 7.1 對 Service 做 port-forward(推薦) ```bash kubectl -n elastic port-forward svc/kb-kb-http 5601:5601 ``` 瀏覽器開: - `https://localhost:5601` > 注意:自簽 TLS 憑證會跳警告,直接「進階/繼續」即可。 ### 7.2 用 curl 驗證(最快) ```bash curl -kI https://localhost:5601 | head ``` 期望: - `HTTP/1.1 200` 或 `302` --- ## 8. Kibana 登入預設帳號/密碼(ECK) ### 8.1 預設帳號 - Username:`elastic` ### 8.2 密碼在哪?(ECK 自動產生,存在 Secret) Elasticsearch 名稱是 `es`,namespace 是 `elastic`,那 secret 通常是: - `es-es-elastic-user` 查看密碼: ```bash kubectl -n elastic get secret es-es-elastic-user \ -o jsonpath='{.data.elastic}' | base64 --decode echo ``` --- ## 9. 修正 `/kibana` 子路徑(basePath)建議設定(避免 404/資源路徑壞掉) 如果你確認 `https://localhost:5601` 正常,但走 Ingress `/kibana` 仍 404 或 UI 壞,請在 Kibana CR 補: ```yaml spec: config: server.basePath: "/kibana" server.rewriteBasePath: true server.publicBaseUrl: "https://pdng-dev.corp.hpicloud.net/kibana" ``` 套用: ```bash kubectl apply -f kibana.yaml ``` --- ## 10. 常見問題排查(速查表) ### 10.1 502 Bad Gateway 最常見原因: - 沒加 `nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"` - endpoints 空 - Ingress 不在 `elastic` namespace(跨 namespace backend 失敗) 快速檢查: ```bash kubectl -n elastic get endpoints kb-kb-http kubectl -n elastic describe ingress kibana-ingress | sed -n '1,220p' ``` ### 10.2 只有 80、沒有 443 - 代表你這個 ingress resource 沒宣告 `tls:`(或 `tlsEnabled=false`) - 但平台可能仍有統一 TLS termination(依你們 ingress controller / LB 設計) ### 10.3 `/kibana` 顯示 404 但 port-forward OK - 高機率是 Kibana basePath 未設定(請看第 9 節)。 --- ## 附錄 A:一行指令版(常用) ```bash helm upgrade --install kibana-ingress . -n elastic -f values/dev-values.yaml kubectl -n elastic get ingress kubectl -n elastic describe ingress kibana-ingress | sed -n '1,200p' kubectl -n elastic get endpoints kb-kb-http kubectl -n elastic port-forward svc/kb-kb-http 5601:5601 ``` --- # PDNG ELK (ECK + Elastic Agent) Deployment & Operations Guide > This document is a **complete deployment and troubleshooting runbook** distilled from the full troubleshooting session. > It is written so the next engineer can **reproduce the setup from scratch** and **understand why each step exists**. --- ## Architecture Overview - **AKS (Azure Kubernetes Service)** - **ECK (Elastic Cloud on Kubernetes)** - Elasticsearch - Kibana - **Elastic Agent (DaemonSet)** - Collects Kubernetes container logs from `/var/log/containers` - Collects Kubernetes & system metrics - **Namespaces** - `elastic` – Elasticsearch & Kibana - `kube-system` – Elastic Agent (DaemonSet) - `pdng` – Application workloads --- ## Key Design Decisions (Important Context) - Elastic Agent configuration **originates from Kibana (Fleet / Integration UI)** - The generated agent policy YAML is: 1. Copied from Kibana 2. Adapted for **AKS DaemonSet usage** 3. Stored in a **ConfigMap (`agent-node-datastreams`)** - Certificates are generated by **ECK**, then **copied across namespaces** - Logs are collected via **filestream input**, not sidecars --- ## 0. Prerequisites ```bash kubectl version kubectl config current-context kubectl auth can-i "*" "*" --all-namespaces ``` Namespaces used: ```bash kubectl create ns elastic kubectl create ns elastic-system ``` --- ## 1. Install ECK Operator ```bash kubectl apply -f https://download.elastic.co/downloads/eck/2.13.0/crds.yaml kubectl apply -f https://download.elastic.co/downloads/eck/2.13.0/operator.yaml ``` Verify: ```bash kubectl -n elastic-system get pods ``` --- ## 2. Deploy Elasticsearch (ECK) ```bash kubectl -n elastic apply -f es.yaml kubectl -n elastic get elasticsearch kubectl -n elastic get pods ``` ### Get elastic password ```bash kubectl -n elastic get secret es-es-elastic-user -o jsonpath='{.data.elastic}' | base64 -d ``` --- ## 3. Deploy Kibana ```bash kubectl -n elastic apply -f kb.yaml kubectl -n elastic get kibana ``` Port forward: ```bash kubectl -n elastic port-forward svc/kb-kb-http 5601:5601 ``` Login: - user: `elastic` - password: from secret above --- ## 4. Generate Elastic Agent Policy (Kibana UI) 1. Kibana → Fleet / Integrations 2. Add **Kubernetes integration** 3. Enable: - Container logs - Kubernetes metrics 4. Save policy 5. Copy generated **agent.yml** > This YAML becomes the content of `agent-node-datastreams` ConfigMap --- ## 5. Prepare Secrets ### 5.1 Copy ES CA cert to kube-system ```bash kubectl -n elastic get secret es-es-http-certs-public -o yaml \ | sed 's/namespace: elastic/namespace: kube-system/' \ | kubectl apply -f - ``` ### 5.2 Create ES credential secret ```bash kubectl -n kube-system create secret generic es-cred \ --from-literal=ES_USERNAME=elastic \ --from-literal=ES_PASSWORD=<PASSWORD> \ --dry-run=client -o yaml | kubectl apply -f - ``` --- ## 6. Deploy Elastic Agent DaemonSet ```bash kubectl -n kube-system apply -f agent-node-datastreams.yaml kubectl -n kube-system apply -f elastic-agent-daemonset.yaml ``` Verify: ```bash kubectl -n kube-system get ds elastic-agent kubectl -n kube-system get pods -l app=elastic-agent -o wide ``` --- ## 7. Verify Log Collection on Node ```bash kubectl -n kube-system exec -it <agent-pod> -c elastic-agent -- sh ``` ```bash ls /var/log/containers | grep pdng | head ``` Expected: ``` pdng-cosmosrestapi-xxxxx.log pdngauthentication-xxxxx.log ... ``` --- ## 8. Verify Logs in Elasticsearch ```bash GET _cat/indices/logs-kubernetes*?v ``` Sample index: ``` .ds-logs-kubernetes.container_logs-pdng-YYYY.MM.DD-000001 ``` Query example: ```json GET .ds-logs-kubernetes.container_logs-pdng-*/_search { "size": 10, "sort": [{"@timestamp": "desc"}], "query": { "term": { "kubernetes.container.name": "pdng-cosmosrestapi" } } } ``` --- ## 9. Kibana Discover Usage ### Data View - Use: `logs-*-*` ### KQL examples ```kql data_stream.dataset: "kubernetes.container_logs" ``` ```kql kubernetes.namespace: "pdng" ``` ```kql kubernetes.container.name: "pdng-cosmosrestapi" ``` ### Add fields in Discover 1. Left panel → Available fields 2. Search `message`, `kubernetes.pod.name`, `log.file.path` 3. Click **Add to table** --- ## 10. Common Issues & Fixes ### OOMKilled (CrashLoopBackOff) Symptoms: ```text Last State: Terminated Reason: OOMKilled ``` Fix: - Increase memory limits - Reduce log volume / harvesters ### Logs exist but Discover empty Cause: - Missing columns - Wrong Data View - Wrong KQL Always validate via **Elasticsearch API first**. --- ## 11. Dev Environment Reset (Nuclear Option) ```bash kubectl -n elastic delete kibana kb kubectl -n elastic delete elasticsearch es kubectl -n elastic delete pvc -l elasticsearch.k8s.elastic.co/cluster-name=es ``` --- ## 12. remove log ```bash= kubectl -n elastic exec -it es-es-default-0 -- bash -lc \ 'PASS="Wv9OnAw5SWeM3c881H314mt3"; \ curl -sku "elastic:${PASS}" -X DELETE "https://localhost:9200/_data_stream/logs-kubernetes.container_logs-pdng?pretty"' ``` ![image](https://hackmd.io/_uploads/BJ_Qg22rZl.png) ![image](https://hackmd.io/_uploads/HJeVx2hB-x.png) ![image](https://hackmd.io/_uploads/BkdEg33Hbx.png) ## 13. fleet server token rotate ```bash= # 1. 建新 token bin/elasticsearch-service-tokens create elastic/fleet-server fleet-server-rotated # 2. 更新 K8s secret kubectl -n elastic create secret generic fleet-server-service-token \ --from-literal=token="NEW_TOKEN" \ --dry-run=client -o yaml | kubectl apply -f - # 3. 滾動 Fleet Server kubectl -n elastic rollout restart deploy/fleet-server ``` ```mermaid flowchart TB subgraph Users[Users / Ops] K[Kibana] end subgraph Ingress[Ingress / LB] IG[Ingress / LB] end subgraph ElasticStack[Elastic Stack] subgraph ControlPlane[Control Plane] FS[Fleet Server] KBN[Kibana] end subgraph HotTier[Hot Tier] EH[Elasticsearch Hot Nodes] end subgraph WarmTier[Warm Tier] EW[Elasticsearch Warm Nodes] end subgraph ColdTier[Cold Tier] EC[Elasticsearch Cold Nodes] end end subgraph Snapshot[Snapshot Storage] S3[(S3 / Azure Blob / GCS)] end subgraph K8s[Kubernetes Cluster] EA[Elastic Agent DaemonSet] KP[Kubelet / Metrics] LOGS[Container Logs] end %% User access Users --> IG --> KBN %% Fleet control KBN --> FS FS --> EA %% Data ingest EA --> EH KP --> EA LOGS --> EA %% ILM flow EH -->|ILM rollover| EW EW -->|ILM migrate| EC %% Searchable snapshots EC -->|snapshot| S3 S3 -->|mount searchable snapshot| EC %% Queries KBN --> EH KBN --> EW KBN --> EC ``` Fleet Server / Kibana Kibana 作為 control plane UI Fleet Server 負責下發 policy 給 Elastic Agent(DaemonSet) Elastic Agent(K8s) 收: Container Logs Kubelet Metrics 送到 Hot tier Hot → Warm → Cold 由 ILM 控制 rollover / migrate Hot:寫入 + 高查詢 Warm:較少查詢 Cold:幾乎不查、成本最低 Searchable Snapshots Cold tier index snapshot 到 Object Storage(S3 / Azure Blob) 查詢時 mount snapshot,不用 full restore 查詢路徑 Kibana 可直接 query Hot / Warm / Cold(Cold 會 hit searchable snapshot) # 修正fleet只見一伴的錯誤 ```bahs= PASS=$(kubectl -n elastic get secret es-es-elastic-user -o go-template='{{.data.elastic | base64decode}}') echo $PASS KB_POD=$(kubectl -n elastic get pod -l kibana.k8s.elastic.co/name=kb -o jsonpath='{.items[0].metadata.name}') kubectl -n elastic exec -it $KB_POD -- bash -lc \ 'curl -sk -u "elastic:'"$PASS"'" -H "kbn-xsrf: true" https://localhost:5601/api/fleet/setup | head -c 2000; echo' // create fleet agent policy kubectl -n elastic exec -it $KB_POD -- bash -lc \ 'curl -sk -u "elastic:'"$PASS"'" \ -H "kbn-xsrf: true" -H "Content-Type: application/json" \ -X POST https://localhost:5601/api/fleet/agent_policies \ -d '"'"'{ "name": "fleet-server-policy", "namespace": "default", "description": "Fleet Server policy", "has_fleet_server": true, "monitoring_enabled": ["logs","metrics"] }'"'"' | head -c 1200; echo' // create k8s agent policy kubectl -n elastic exec -it $KB_POD -- bash -lc \ 'curl -sk -u "elastic:'"$PASS"'" \ -H "kbn-xsrf: true" -H "Content-Type: application/json" \ -X POST https://localhost:5601/api/fleet/agent_policies \ -d '"'"'{ "name": "k8s-pdng-logs", "namespace": "pdng", "description": "Collect Kubernetes logs/metrics for pdng namespace", "monitoring_enabled": ["logs","metrics"] }'"'"' | head -c 1200; echo' # 安裝 kubernetes integration package kubectl -n elastic exec -it $KB_POD -- bash -lc \ 'curl -sk -u "elastic:'"$PASS"'" \ -H "kbn-xsrf: true" -H "Content-Type: application/json" \ -X POST "https://localhost:5601/api/fleet/epm/packages/kubernetes" \ -d '"'"'{"force":true}'"'"' | head -c 1200; echo' # 安裝 system integration package kubectl -n elastic exec -it $KB_POD -- bash -lc \ 'curl -sk -u "elastic:'"$PASS"'" \ -H "kbn-xsrf: true" -H "Content-Type: application/json" \ -X POST "https://localhost:5601/api/fleet/epm/packages/system" \ -d '"'"'{"force":true}'"'"' | head -c 1200; echo' ``` # data retention 保命機制 ```bash= ❯ curl -sk -u "elastic:${PASS}" -H 'Content-Type: application/json' \ -X PUT "https://es-es-http.elastic.svc:9200/_data_stream/logs-kubernetes.container_logs-pdng/_lifecycle" \ -d '{ "data_retention": "6h" }' ❯ curl -sk -u "elastic:${PASS}" -H 'Content-Type: application/json' \ -X PUT "https://es-es-http.elastic.svc:9200/_data_stream/metrics-kubernetes.container_logs-pdng/_lifecycle" \ -d '{ "data_retention": "6h" }' curl -sk -u "elastic:${PASS}" -H 'Content-Type: application/json' \ -X PUT "https://es-es-http.elastic.svc:9200/_cluster/settings" \ -d '{ "persistent": { "cluster.routing.allocation.disk.watermark.low": "75%", "cluster.routing.allocation.disk.watermark.high": "85%", "cluster.routing.allocation.disk.watermark.flood_stage": "90%" } }' ``` # Elastic Fleet + Kubernetes (PDNG) Setup – Full Troubleshooting & Resolution Guide > 本文件整理了整個 **Elastic Stack (ECK) + Fleet Server + Kubernetes Elastic Agent (PDNG namespace)** 的實際排錯與最終可行配置,可作為 **重建 / GitOps / Onboarding** 的標準 README。 --- ## 目標 * 在 AKS / Kubernetes 上部署: * Elasticsearch (ECK) * Kibana (ECK) * Fleet Server (Elastic Agent – Deployment) * Elastic Agent (DaemonSet) * 使用 **Fleet Mode** 收集: * `pdng` namespace 的 **container logs** * Kubernetes **metrics / events** * 確保資料可在 **Discover / Dashboards** 中查詢 --- ## 一、最終架構 ``` Kibana (ECK) └── Fleet ├── Fleet Server (Agent / Deployment) └── Elastic Agent (DaemonSet) ├── logs-kubernetes.container_logs ├── metrics-kubernetes.* └── namespace = pdng ``` --- ## 二、前置條件 * Kubernetes Cluster (AKS) * 已安裝 ECK Operator * Namespace: `elastic` * 可使用 `kubectl exec` 進入 Kibana / ES Pod --- ## 三、部署順序(**非常重要**) ### 1️⃣ Elasticsearch (ECK) ```bash kubectl -n elastic apply -f elasticsearch.yaml ``` 確認: ```bash kubectl -n elastic get es ``` --- ### 2️⃣ Kibana (ECK) **注意:`xpack.fleet.fleetServerHosts` 格式必須是 array of string** ```yaml xpack.fleet.enabled: true xpack.fleet.agents.enabled: true xpack.fleet.fleetServerHosts: - https://fleet-server-agent-http.elastic.svc:8220 ``` 部署後確認 Kibana 正常: ```bash kubectl -n elastic get pod -l kibana.k8s.elastic.co/name=kb ``` --- ### 3️⃣ 初始化 Fleet ```bash PASS=$(kubectl -n elastic get secret es-es-elastic-user -o go-template='{{.data.elastic | base64decode}}') KB_POD=$(kubectl -n elastic get pod -l kibana.k8s.elastic.co/name=kb -o jsonpath='{.items[0].metadata.name}') kubectl -n elastic exec -it $KB_POD -- bash -lc \ 'curl -sk -u "elastic:'"$PASS"'" -H "kbn-xsrf: true" -X POST https://localhost:5601/api/fleet/setup' ``` 預期結果: ```json {"isInitialized": true} ``` --- ## 四、Fleet Server ### 4️⃣ 建立 Fleet Server Policy ```bash curl -sk -u "elastic:$PASS" -H "kbn-xsrf: true" \ -X POST https://localhost:5601/api/fleet/agent_policies \ -H "Content-Type: application/json" \ -d '{ "name": "fleet-server-policy", "namespace": "default", "has_fleet_server": true, "monitoring_enabled": ["logs","metrics"] }' ``` --- ### 5️⃣ 部署 Fleet Server (Agent – Deployment) ```yaml spec: mode: fleet fleetServerEnabled: true policyID: <FLEET_SERVER_POLICY_ID> ``` 確認: ```bash kubectl -n elastic get agent fleet-server ``` 狀態應為:`green` --- ## 五、PDNG Elastic Agent(DaemonSet) ### 6️⃣ 建立 PDNG Agent Policy ```bash curl -sk -u "elastic:$PASS" -H "kbn-xsrf: true" \ -X POST https://localhost:5601/api/fleet/agent_policies \ -H "Content-Type: application/json" \ -d '{ "name": "k8s-pdng-logs", "namespace": "pdng", "monitoring_enabled": ["logs","metrics"] }' ``` --- ### 7️⃣ 安裝 Integration Packages ```bash # Kubernetes curl -sk -u "elastic:$PASS" -H "kbn-xsrf: true" \ -X POST https://localhost:5601/api/fleet/epm/packages/kubernetes \ -H "Content-Type: application/json" -d '{"force":true}' # System curl -sk -u "elastic:$PASS" -H "kbn-xsrf: true" \ -X POST https://localhost:5601/api/fleet/epm/packages/system \ -H "Content-Type: application/json" -d '{"force":true}' ``` --- ### 8️⃣ 建立 Kubernetes Package Policy(**關鍵**) ⚠️ Fleet API **不能用簡化 inputs**,必須完整 vars 👉 **實務建議:第一次一定用 Kibana UI 建立,之後再 export JSON** 最小可用邏輯: * `filestream` → container logs * `kubernetes/metrics` → state / events (實際 JSON 請以 UI export 為準) --- ### 9️⃣ 部署 Elastic Agent (DaemonSet) ```yaml spec: mode: fleet policyID: <PDNG_POLICY_ID> daemonSet: podTemplate: spec: serviceAccountName: elastic-agent hostNetwork: true ``` 確認: ```bash kubectl -n elastic get agent ``` ```text elastic-agent-pdng green 3 / 3 ``` --- ## 六、驗證 ### Agent Policies ```bash curl -sk -u "elastic:$PASS" -H "kbn-xsrf: true" \ https://localhost:5601/api/fleet/agent_policies | jq '.items[] | {name,namespace,agents}' ``` ### Indices ```bash curl -sku "elastic:$PASS" https://localhost:9200/_cat/indices/logs-kubernetes*?v curl -sku "elastic:$PASS" https://localhost:9200/_cat/indices/metrics-kubernetes*?v ``` ### Discover * Index pattern:`logs-*`, `metrics-*` * Filter:`kubernetes.namespace : pdng` --- ## 七、常見錯誤總結 | 錯誤 | 原因 | | ------------------------------------- | ----------------------------------- | | `fleetServerHosts parse error` | Kibana config 結構錯誤 | | `Unsupported saved object type` | Fleet object 不能直接刪 ES index | | `FLEET_URL is required` | Agent 非 Fleet Server 卻未指定 Fleet URL | | `inputs.undefined.vars.* is required` | API 建 package policy 未補 vars | --- ## 八、最佳實務(重點) ✅ Fleet 第一次 **一定用 UI 建立 Integration** ✅ 再用 API / GitOps 固化 ❌ 不要嘗試手寫最小 inputs(會卡 validation) --- ## 完成 🎉 此時: * Fleet Server:`green` * Elastic Agent (pdng):`green` * Discover / Dashboard 可看到 pdng logs & metrics --- *This README was generated from a real-world production troubleshooting session.* # Elastic Agent + Fleet on AKS — Kubernetes Logs & Metrics Troubleshooting Guide > **Status: ✅ Resolved** > > This document summarizes the full end‑to‑end troubleshooting process that led to successfully collecting > `logs-kubernetes*` and `metrics-kubernetes*` data streams using **Elastic Agent (Fleet mode)** on **AKS**. --- ## 🧩 Problem Statement Even though: - Elastic Agent was **HEALTHY** - Fleet showed the Kubernetes integration **enabled** - Elasticsearch output was reachable - No fatal errors appeared in Kibana UI ❌ **No `logs-kubernetes*` or `metrics-kubernetes*` data streams were created** ❌ `_data_stream` API returned nothing ❌ Discover showed no Kubernetes data --- ## 🧠 Root Causes (Multiple, Layered) This issue was **not a single misconfiguration**, but a combination of Fleet, Kubernetes, and RBAC behaviors. ### 1. Fleet Integration UI ≠ Active Policy Revision - Even if all toggles are ON, Fleet **does NOT re-render inputs** unless: - The integration is explicitly **Saved** - A new **policy revision** is generated ➡️ Result: Agent was running, but **no Kubernetes inputs were actually applied** --- ### 2. Missing ServiceAccount Token in Agent Pods Errors observed: ``` reading bearer token file: /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory ``` Cause: - `automountServiceAccountToken` was disabled (default in some hardened setups) Impact: - Kubernetes metrics inputs failed immediately - No data streams were created --- ### 3. Insufficient RBAC for Cluster‑Scoped Resources Errors observed: ``` cannot list resource "nodes" cannot get resource "leases.coordination.k8s.io" ``` Cause: - Elastic Agent requires **cluster‑level RBAC** - Especially for: - `nodes` - `leases` (leader election) --- ## ✅ Final Working Architecture ``` Elastic Agent (DaemonSet, Fleet mode) ├─ ServiceAccount token mounted ├─ ClusterRole + ClusterRoleBinding ├─ Kubernetes integration saved (policy revision bumped) └─ Data streams auto‑created: - logs-kubernetes.container_logs-pdng - metrics-kubernetes.node-pdng - metrics-kubernetes.pod-pdng - metrics-kubernetes.state_* ``` --- ## 🛠️ Step‑by‑Step Fix (Authoritative) ### ✅ Step 1 — Fix ServiceAccount Token Mount ```bash kubectl -n elastic patch agent elastic-agent-pdng --type=merge -p '{ "spec": { "daemonSet": { "podTemplate": { "spec": { "automountServiceAccountToken": true } } } } }' kubectl -n elastic patch sa elastic-agent --type=merge -p '{ "automountServiceAccountToken": true }' ``` Restart agent: ```bash kubectl -n elastic rollout restart ds/elastic-agent-pdng-agent ``` --- ### ✅ Step 2 — Verify RBAC (Must Be YES) ```bash kubectl auth can-i list nodes --as=system:serviceaccount:elastic:elastic-agent kubectl auth can-i get leases.coordination.k8s.io -n elastic --as=system:serviceaccount:elastic:elastic-agent ``` If **NO**, fix ClusterRole / ClusterRoleBinding. --- ### ✅ Step 3 — **Critical Step: Re‑Save Fleet Integration** > Even if everything looks enabled — **SAVE IT AGAIN** In Kibana: ``` Fleet → Agent Policies → Kubernetes integration → Save integration ``` Why: - Forces new policy revision - Re-renders inputs - Triggers agent reload ⚠️ This step is **mandatory**. --- ### ✅ Step 4 — Verify Agent Runtime Config ```bash kubectl -n elastic exec -it <agent-pod> -- sh -lc ' elastic-agent status elastic-agent inspect components --show-config | egrep "dataset: kubernetes|data_stream" | head -n 50 ' ``` Expected: ``` dataset: kubernetes.container_logs dataset: kubernetes.node dataset: kubernetes.pod ``` --- ### ✅ Step 5 — Confirm Data Streams Exist ```bash curl -sk -u "elastic:${PASS}" "https://es-es-http.elastic.svc:9200/_data_stream" | jq -r '.data_streams[].name' | egrep "logs-kubernetes|metrics-kubernetes" ``` ✅ If present → pipeline is fully functional. --- ## 🔍 Validation Signals (Green Flags) - Discover shows Kubernetes logs - `_data_stream` lists logs/metrics - Agent logs show `acked` events - No more `403`, `forbidden`, `no configuration provided` --- ## ❗ Common Pitfalls (Avoid These) | Mistake | Result | |------|------| | Not re‑saving integration | No inputs applied | | No SA token | Metrics always fail | | Namespace‑only RBAC | Node metrics forbidden | | Assuming UI = runtime | Silent failure | --- ## 🏁 Final Verdict ✅ Configuration: **Correct** ✅ RBAC: **Correct** ✅ Fleet Policy: **Applied** ✅ Elastic Agent: **Healthy & Publishing** 🎉 **Kubernetes logs and metrics fully operational** --- ## 📎 Notes - Tested on: - Elastic Stack 8.13.4 - AKS - ECK + Fleet Server - Applies to: - Logs - Metrics - Future Kubernetes integrations --- **Author:** Troubleshooting session distilled from live production debugging **Status:** Battle‑tested ✔️