# ECK on AKS – Deployment & Troubleshooting Guide
> This README summarizes the **end‑to‑end setup, validation, and troubleshooting** for deploying **Elastic Cloud on Kubernetes (ECK)** with **Elasticsearch + Kibana** on **AKS**, based on the current implementation and issues resolved so far.
---
## 1. Architecture Overview
### Components
| Component | Namespace | Deployment Method | Purpose |
| -------------- | ------------------- | ----------------- | ---------------------------------- |
| ECK Operator | `elastic-system` | Helm | Manages lifecycle of Elastic Stack |
| Elasticsearch | `elastic` | ECK CRD (YAML) | Search / storage engine |
| Kibana | `elastic` | ECK CRD (YAML) | UI / visualization |
| Ingress (PDNG) | `pdng` or `default` | YAML / Kustomize | API routing |
### Key Design Decisions
* **Helm is used only for the ECK Operator**
* **Elasticsearch / Kibana are managed via CRDs (not Helm)**
* **Kustomize** is used to separate `base` and `env (dev/prod)`
* **Elasticsearch is NOT exposed publicly**
* **Kibana should use a dedicated host (recommended)**
---
## 2. Repository Structure
```
AKS/
├─ helm/
│ └─ eck-operator/
│ ├─ Chart.yaml
│ └─ values.yaml
│
└─ yaml/
├─ base/
│ ├─ kustomization.yaml
│ ├─ elastic/
│ │ ├─ namespace.yaml
│ │ ├─ elasticsearch.yaml
│ │ └─ kibana.yaml
│ ├─ pdng-ingress.yaml
│ ├─ pdng-ingress-agw.yaml
│ └─ other-secrets.yaml
│
├─ dev/
│ └─ kustomization.yaml
└─ prod/
└─ kustomization.yaml
```
---
## 3. Step‑by‑Step Deployment
### 3.1 Install ECK Operator (Helm)
```bash
helm repo add elastic https://helm.elastic.co
helm repo update
cd AKS/helm/eck-operator
helm dependency build
helm upgrade --install eck-operator . \
-n elastic-system --create-namespace
```
### Verify Operator
```bash
kubectl -n elastic-system get pods
kubectl get crd | Select-String elastic
```
Expected CRDs:
* `elasticsearches.elasticsearch.k8s.elastic.co`
* `kibanas.kibana.k8s.elastic.co`
---
### 3.2 Deploy Elastic Stack (Base)
```bash
kubectl apply -k AKS/helm/elastic
```
This creates:
* `Namespace elastic`
* `Elasticsearch es`
* `Kibana kb`
---
## 4. Validation Checklist
### 4.1 Elasticsearch
```bash
kubectl -n elastic get elasticsearch
```
Expected:
```
NAME HEALTH NODES VERSION PHASE
es green 1 8.13.x Ready
```
### 4.2 Kibana
```bash
kubectl -n elastic get kibana
kubectl -n elastic get pods
```
Expected:
* Pod: `Running 1/1`
* Health: `green`
---
## 5. Kibana Readiness Issue (Root Cause & Fix)
### ❌ Observed Issue
* Kibana pod stuck at `Ready: False`
* `HEALTH = red`
* Readiness probe fails on:
```
https://<pod-ip>:5601/login → 404
```
### 🔍 Root Cause
Kibana was configured with:
```yaml
server.basePath: /kibana
server.rewriteBasePath: true
```
But **ECK readiness probe is hardcoded to `/login`**, not `/kibana/login`.
Result:
* Kibana is actually running
* Readiness probe never succeeds
---
## 6. Recommended Fix (Production‑Grade)
### ✅ Use Dedicated Kibana Host (NO basePath)
#### Kibana CRD (Fixed)
```yaml
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: kb
namespace: elastic
spec:
version: 8.13.4
count: 1
elasticsearchRef:
name: es
```
#### Kibana Ingress (Dedicated Host)
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: kibana-ingress
namespace: elastic
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
ingressClassName: pdng-nginx-ingress-class-cus-dev
rules:
- host: kibana-dev.corp.hpicloud.net
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kb-kb-http
port:
number: 5601
```
### Result
* Readiness probe `/login` → ✅ 200
* Kibana Pod → `Ready`
* Kibana Health → `green`
---
## 7. Access Credentials
### Elasticsearch / Kibana Default User
```bash
kubectl -n elastic get secret es-es-elastic-user -o jsonpath='{.data.elastic}' | base64 --decode
```
User:
```
elastic
```
---
## 8. Key Lessons Learned
* ❌ Do NOT use path‑based ingress (`/kibana`) with ECK Kibana
* ✅ Use **dedicated host** for Kibana
* ✅ Keep Operator (Helm) and Stack (CRD) responsibilities separated
* ✅ Always include Elastic resources in `base/kustomization.yaml`
* ⚠️ Avoid deploying duplicate ingress rules across namespaces
---
## 9. Current Status
| Component | Status |
| ------------- | ------------------------- |
| ECK Operator | ✅ Running |
| Elasticsearch | ✅ Green |
| Kibana | ⚠️ Needs basePath removal |
| Ingress | ⚠️ Needs consolidation |
---
# Kibana (ECK) + NGINX Ingress (Helm) 部署筆記(可直接複製使用)
> 時區:Asia/Taipei
> 範例環境:AKS + ECK(Elasticsearch/Kibana 8.13.4)
> 目標:把 `elastic` namespace 的 Kibana 透過 **NGINX Ingress** 掛到 `https://pdng-dev.corp.hpicloud.net/kibana`
> 備註:本文件整理的是我們剛剛實際完成/驗證過的步驟與指令。
---
## 0. 前置條件與關鍵觀念
### 0.1 Ingress 不能跨 namespace 指 Service
- 你的 Kibana Service:`elastic` namespace 內的 `kb-kb-http`
- 所以 Kibana 的 Ingress **必須**建立在 `elastic` namespace(不然一定 502 / 找不到 backend)。
### 0.2 Kibana Service 5601 是 HTTPS(ECK 自簽 TLS)
你確認到的 Service 片段:
- `ports[].name: https`
- `port: 5601`
因此 NGINX Ingress 必須加:
- `nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"`
### 0.3 用 `/kibana` 子路徑常見 404/資源路徑問題
Ingress 轉發成功後,如果看到 404 或頁面資源壞掉,多半是 **Kibana 沒設定 basePath**:
- `server.basePath: "/kibana"`
- `server.rewriteBasePath: true`
- `server.publicBaseUrl: "https://<host>/kibana"`
建議先用 port-forward 確認 Kibana 本體 OK,再處理 basePath。
---
## 1. 快速盤點現況(確認 ingress class / namespaces)
### 1.1 看現有 Ingress(包含 namespace)
```bash
kubectl get ingress --all-namespaces | grep pdng-ingress
```
你看到的重點(例):
- `pdng` namespace:`pdng-ingress`(nginx class) & `pdng-ingress-agw`(AGW)
- `elastic` namespace:Kibana / Elasticsearch (ECK CRD)
---
## 2. Kibana / Elasticsearch(ECK)CRD(你目前的樣子)
### 2.1 Kibana(ECK)
```yaml
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: kb
namespace: elastic
spec:
version: 8.13.4
count: 1
elasticsearchRef:
name: es
http:
tls:
selfSignedCertificate:
disabled: false
```
### 2.2 Elasticsearch(ECK)
```yaml
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: es
namespace: elastic
spec:
version: 8.13.4
http:
tls:
selfSignedCertificate:
disabled: false
nodeSets:
- name: default
count: 1
config:
node.store.allow_mmap: false
```
---
## 3. 確認 Kibana Service(非常重要:service 名稱 + HTTPS)
```bash
kubectl -n elastic get svc kb-kb-http -o yaml | sed -n '1,120p'
```
你確認到:
- Service 名稱:`kb-kb-http`
- Port:`5601`
- `ports[].name: https` → 代表 upstream 是 HTTPS
---
## 4. 用 Helm 產生 Kibana Ingress(核心步驟)
> 你目前的 repo 結構是同一個 chart(`Chart.yaml` 在 `INFRASTRUCTURE/AKS/helm/`),
> 並且 `templates/` 內新增 `kibana-ingress.yaml`。
> 所以 chart path 就是 `.`(在 Chart.yaml 同層)。
### 4.1 Template:kibana-ingress.yaml(含 enabled 包裝)
把整個資源包起來:
```yaml
{{- if .Values.kibanaIngress.enabled }}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: {{ .Values.kibanaIngress.name | default "kibana-ingress" }}
{{- if .Values.kibanaIngress.annotations }}
annotations:
{{- toYaml .Values.kibanaIngress.annotations | nindent 4 }}
{{- end }}
spec:
{{- if .Values.kibanaIngress.ingressClassName }}
ingressClassName: {{ .Values.kibanaIngress.ingressClassName }}
{{- end }}
rules:
- {{- if .Values.kibanaIngress.host }}
host: {{ .Values.kibanaIngress.host }}
{{- end }}
http:
paths:
- path: {{ .Values.kibanaIngress.path | default "/kibana(/|$)(.*)" }}
pathType: {{ .Values.kibanaIngress.pathType | default "ImplementationSpecific" }}
backend:
service:
name: {{ .Values.kibanaIngress.backend.service.name | default "kb-kb-http" }}
port:
number: {{ .Values.kibanaIngress.backend.service.port.number | default 5601 }}
{{- if .Values.kibanaIngress.tlsEnabled }}
tls:
- hosts:
- {{ .Values.kibanaIngress.host }}
secretName: {{ .Values.kibanaIngress.tlsSecretName | default "kibana-tls" }}
{{- end }}
{{- end }}
```
### 4.2 values/dev-values.yaml:新增 kibanaIngress 區塊
> 重點:backend 是 `kb-kb-http`,且 upstream 是 HTTPS,所以必須加 `backend-protocol: "HTTPS"`。
```yaml
kibanaIngress:
enabled: true
name: kibana-ingress
host: pdng-dev.corp.hpicloud.net
ingressClassName: pdng-nginx-ingress-class-cus-dev
path: /kibana(/|$)(.*)
pathType: ImplementationSpecific
backend:
service:
name: kb-kb-http
port:
number: 5601
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
nginx.ingress.kubernetes.io/use-regex: "true"
nginx.ingress.kubernetes.io/rewrite-target: /$2
# 建議先讓 TLS 由既有入口/平台處理;若你要 Ingress 自己宣告 tls secret 再開 true
tlsEnabled: false
```
---
## 5. Helm 部署指令(單行版)
### 5.1 先切到 Chart.yaml 同層
```bash
cd INFRASTRUCTURE/AKS/helm
```
### 5.2 用 Helm 安裝到 elastic namespace(非常重要)
```bash
helm upgrade --install kibana-ingress . -n elastic -f values/dev-values.yaml
```
> 為什麼一定要 `-n elastic`:Ingress backend service `kb-kb-http` 在 `elastic`。
---
## 6. 驗證(kubectl 只用來觀察,不影響 Helm 管理)
### 6.1 看 Ingress 是否建立成功
```bash
kubectl -n elastic get ingress
```
### 6.2 看 Ingress rule/backends/event 是否正常
```bash
kubectl -n elastic describe ingress kibana-ingress | sed -n '1,200p'
```
期望看到:
- Path:`/kibana(/|$)(.*)`
- Backend:`kb-kb-http:5601`
- Events:只有 Sync,沒有 Error
### 6.3 確認 endpoints 存在(沒有 endpoints 100% 502)
```bash
kubectl -n elastic get endpoints kb-kb-http
```
---
## 7. 如果 Ingress 打 `/kibana` 看到 404:先用 Port Forwarding 確認 Kibana 本體
### 7.1 對 Service 做 port-forward(推薦)
```bash
kubectl -n elastic port-forward svc/kb-kb-http 5601:5601
```
瀏覽器開:
- `https://localhost:5601`
> 注意:自簽 TLS 憑證會跳警告,直接「進階/繼續」即可。
### 7.2 用 curl 驗證(最快)
```bash
curl -kI https://localhost:5601 | head
```
期望:
- `HTTP/1.1 200` 或 `302`
---
## 8. Kibana 登入預設帳號/密碼(ECK)
### 8.1 預設帳號
- Username:`elastic`
### 8.2 密碼在哪?(ECK 自動產生,存在 Secret)
Elasticsearch 名稱是 `es`,namespace 是 `elastic`,那 secret 通常是:
- `es-es-elastic-user`
查看密碼:
```bash
kubectl -n elastic get secret es-es-elastic-user \
-o jsonpath='{.data.elastic}' | base64 --decode
echo
```
---
## 9. 修正 `/kibana` 子路徑(basePath)建議設定(避免 404/資源路徑壞掉)
如果你確認 `https://localhost:5601` 正常,但走 Ingress `/kibana` 仍 404 或 UI 壞,請在 Kibana CR 補:
```yaml
spec:
config:
server.basePath: "/kibana"
server.rewriteBasePath: true
server.publicBaseUrl: "https://pdng-dev.corp.hpicloud.net/kibana"
```
套用:
```bash
kubectl apply -f kibana.yaml
```
---
## 10. 常見問題排查(速查表)
### 10.1 502 Bad Gateway
最常見原因:
- 沒加 `nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"`
- endpoints 空
- Ingress 不在 `elastic` namespace(跨 namespace backend 失敗)
快速檢查:
```bash
kubectl -n elastic get endpoints kb-kb-http
kubectl -n elastic describe ingress kibana-ingress | sed -n '1,220p'
```
### 10.2 只有 80、沒有 443
- 代表你這個 ingress resource 沒宣告 `tls:`(或 `tlsEnabled=false`)
- 但平台可能仍有統一 TLS termination(依你們 ingress controller / LB 設計)
### 10.3 `/kibana` 顯示 404 但 port-forward OK
- 高機率是 Kibana basePath 未設定(請看第 9 節)。
---
## 附錄 A:一行指令版(常用)
```bash
helm upgrade --install kibana-ingress . -n elastic -f values/dev-values.yaml
kubectl -n elastic get ingress
kubectl -n elastic describe ingress kibana-ingress | sed -n '1,200p'
kubectl -n elastic get endpoints kb-kb-http
kubectl -n elastic port-forward svc/kb-kb-http 5601:5601
```
---
# PDNG ELK (ECK + Elastic Agent) Deployment & Operations Guide
> This document is a **complete deployment and troubleshooting runbook** distilled from the full troubleshooting session.
> It is written so the next engineer can **reproduce the setup from scratch** and **understand why each step exists**.
---
## Architecture Overview
- **AKS (Azure Kubernetes Service)**
- **ECK (Elastic Cloud on Kubernetes)**
- Elasticsearch
- Kibana
- **Elastic Agent (DaemonSet)**
- Collects Kubernetes container logs from `/var/log/containers`
- Collects Kubernetes & system metrics
- **Namespaces**
- `elastic` – Elasticsearch & Kibana
- `kube-system` – Elastic Agent (DaemonSet)
- `pdng` – Application workloads
---
## Key Design Decisions (Important Context)
- Elastic Agent configuration **originates from Kibana (Fleet / Integration UI)**
- The generated agent policy YAML is:
1. Copied from Kibana
2. Adapted for **AKS DaemonSet usage**
3. Stored in a **ConfigMap (`agent-node-datastreams`)**
- Certificates are generated by **ECK**, then **copied across namespaces**
- Logs are collected via **filestream input**, not sidecars
---
## 0. Prerequisites
```bash
kubectl version
kubectl config current-context
kubectl auth can-i "*" "*" --all-namespaces
```
Namespaces used:
```bash
kubectl create ns elastic
kubectl create ns elastic-system
```
---
## 1. Install ECK Operator
```bash
kubectl apply -f https://download.elastic.co/downloads/eck/2.13.0/crds.yaml
kubectl apply -f https://download.elastic.co/downloads/eck/2.13.0/operator.yaml
```
Verify:
```bash
kubectl -n elastic-system get pods
```
---
## 2. Deploy Elasticsearch (ECK)
```bash
kubectl -n elastic apply -f es.yaml
kubectl -n elastic get elasticsearch
kubectl -n elastic get pods
```
### Get elastic password
```bash
kubectl -n elastic get secret es-es-elastic-user -o jsonpath='{.data.elastic}' | base64 -d
```
---
## 3. Deploy Kibana
```bash
kubectl -n elastic apply -f kb.yaml
kubectl -n elastic get kibana
```
Port forward:
```bash
kubectl -n elastic port-forward svc/kb-kb-http 5601:5601
```
Login:
- user: `elastic`
- password: from secret above
---
## 4. Generate Elastic Agent Policy (Kibana UI)
1. Kibana → Fleet / Integrations
2. Add **Kubernetes integration**
3. Enable:
- Container logs
- Kubernetes metrics
4. Save policy
5. Copy generated **agent.yml**
> This YAML becomes the content of `agent-node-datastreams` ConfigMap
---
## 5. Prepare Secrets
### 5.1 Copy ES CA cert to kube-system
```bash
kubectl -n elastic get secret es-es-http-certs-public -o yaml \
| sed 's/namespace: elastic/namespace: kube-system/' \
| kubectl apply -f -
```
### 5.2 Create ES credential secret
```bash
kubectl -n kube-system create secret generic es-cred \
--from-literal=ES_USERNAME=elastic \
--from-literal=ES_PASSWORD=<PASSWORD> \
--dry-run=client -o yaml | kubectl apply -f -
```
---
## 6. Deploy Elastic Agent DaemonSet
```bash
kubectl -n kube-system apply -f agent-node-datastreams.yaml
kubectl -n kube-system apply -f elastic-agent-daemonset.yaml
```
Verify:
```bash
kubectl -n kube-system get ds elastic-agent
kubectl -n kube-system get pods -l app=elastic-agent -o wide
```
---
## 7. Verify Log Collection on Node
```bash
kubectl -n kube-system exec -it <agent-pod> -c elastic-agent -- sh
```
```bash
ls /var/log/containers | grep pdng | head
```
Expected:
```
pdng-cosmosrestapi-xxxxx.log
pdngauthentication-xxxxx.log
...
```
---
## 8. Verify Logs in Elasticsearch
```bash
GET _cat/indices/logs-kubernetes*?v
```
Sample index:
```
.ds-logs-kubernetes.container_logs-pdng-YYYY.MM.DD-000001
```
Query example:
```json
GET .ds-logs-kubernetes.container_logs-pdng-*/_search
{
"size": 10,
"sort": [{"@timestamp": "desc"}],
"query": {
"term": {
"kubernetes.container.name": "pdng-cosmosrestapi"
}
}
}
```
---
## 9. Kibana Discover Usage
### Data View
- Use: `logs-*-*`
### KQL examples
```kql
data_stream.dataset: "kubernetes.container_logs"
```
```kql
kubernetes.namespace: "pdng"
```
```kql
kubernetes.container.name: "pdng-cosmosrestapi"
```
### Add fields in Discover
1. Left panel → Available fields
2. Search `message`, `kubernetes.pod.name`, `log.file.path`
3. Click **Add to table**
---
## 10. Common Issues & Fixes
### OOMKilled (CrashLoopBackOff)
Symptoms:
```text
Last State: Terminated
Reason: OOMKilled
```
Fix:
- Increase memory limits
- Reduce log volume / harvesters
### Logs exist but Discover empty
Cause:
- Missing columns
- Wrong Data View
- Wrong KQL
Always validate via **Elasticsearch API first**.
---
## 11. Dev Environment Reset (Nuclear Option)
```bash
kubectl -n elastic delete kibana kb
kubectl -n elastic delete elasticsearch es
kubectl -n elastic delete pvc -l elasticsearch.k8s.elastic.co/cluster-name=es
```
---
## 12. remove log
```bash=
kubectl -n elastic exec -it es-es-default-0 -- bash -lc \
'PASS="Wv9OnAw5SWeM3c881H314mt3"; \
curl -sku "elastic:${PASS}" -X DELETE "https://localhost:9200/_data_stream/logs-kubernetes.container_logs-pdng?pretty"'
```



## 13. fleet server token rotate
```bash=
# 1. 建新 token
bin/elasticsearch-service-tokens create elastic/fleet-server fleet-server-rotated
# 2. 更新 K8s secret
kubectl -n elastic create secret generic fleet-server-service-token \
--from-literal=token="NEW_TOKEN" \
--dry-run=client -o yaml | kubectl apply -f -
# 3. 滾動 Fleet Server
kubectl -n elastic rollout restart deploy/fleet-server
```
```mermaid
flowchart TB
subgraph Users[Users / Ops]
K[Kibana]
end
subgraph Ingress[Ingress / LB]
IG[Ingress / LB]
end
subgraph ElasticStack[Elastic Stack]
subgraph ControlPlane[Control Plane]
FS[Fleet Server]
KBN[Kibana]
end
subgraph HotTier[Hot Tier]
EH[Elasticsearch Hot Nodes]
end
subgraph WarmTier[Warm Tier]
EW[Elasticsearch Warm Nodes]
end
subgraph ColdTier[Cold Tier]
EC[Elasticsearch Cold Nodes]
end
end
subgraph Snapshot[Snapshot Storage]
S3[(S3 / Azure Blob / GCS)]
end
subgraph K8s[Kubernetes Cluster]
EA[Elastic Agent DaemonSet]
KP[Kubelet / Metrics]
LOGS[Container Logs]
end
%% User access
Users --> IG --> KBN
%% Fleet control
KBN --> FS
FS --> EA
%% Data ingest
EA --> EH
KP --> EA
LOGS --> EA
%% ILM flow
EH -->|ILM rollover| EW
EW -->|ILM migrate| EC
%% Searchable snapshots
EC -->|snapshot| S3
S3 -->|mount searchable snapshot| EC
%% Queries
KBN --> EH
KBN --> EW
KBN --> EC
```
Fleet Server / Kibana
Kibana 作為 control plane UI
Fleet Server 負責下發 policy 給 Elastic Agent(DaemonSet)
Elastic Agent(K8s)
收:
Container Logs
Kubelet Metrics
送到 Hot tier
Hot → Warm → Cold
由 ILM 控制 rollover / migrate
Hot:寫入 + 高查詢
Warm:較少查詢
Cold:幾乎不查、成本最低
Searchable Snapshots
Cold tier index snapshot 到 Object Storage(S3 / Azure Blob)
查詢時 mount snapshot,不用 full restore
查詢路徑
Kibana 可直接 query Hot / Warm / Cold(Cold 會 hit searchable snapshot)
# 修正fleet只見一伴的錯誤
```bahs=
PASS=$(kubectl -n elastic get secret es-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')
echo $PASS
KB_POD=$(kubectl -n elastic get pod -l kibana.k8s.elastic.co/name=kb -o jsonpath='{.items[0].metadata.name}')
kubectl -n elastic exec -it $KB_POD -- bash -lc \
'curl -sk -u "elastic:'"$PASS"'" -H "kbn-xsrf: true" https://localhost:5601/api/fleet/setup | head -c 2000; echo'
// create fleet agent policy
kubectl -n elastic exec -it $KB_POD -- bash -lc \
'curl -sk -u "elastic:'"$PASS"'" \
-H "kbn-xsrf: true" -H "Content-Type: application/json" \
-X POST https://localhost:5601/api/fleet/agent_policies \
-d '"'"'{
"name": "fleet-server-policy",
"namespace": "default",
"description": "Fleet Server policy",
"has_fleet_server": true,
"monitoring_enabled": ["logs","metrics"]
}'"'"' | head -c 1200; echo'
// create k8s agent policy
kubectl -n elastic exec -it $KB_POD -- bash -lc \
'curl -sk -u "elastic:'"$PASS"'" \
-H "kbn-xsrf: true" -H "Content-Type: application/json" \
-X POST https://localhost:5601/api/fleet/agent_policies \
-d '"'"'{
"name": "k8s-pdng-logs",
"namespace": "pdng",
"description": "Collect Kubernetes logs/metrics for pdng namespace",
"monitoring_enabled": ["logs","metrics"]
}'"'"' | head -c 1200; echo'
# 安裝 kubernetes integration package
kubectl -n elastic exec -it $KB_POD -- bash -lc \
'curl -sk -u "elastic:'"$PASS"'" \
-H "kbn-xsrf: true" -H "Content-Type: application/json" \
-X POST "https://localhost:5601/api/fleet/epm/packages/kubernetes" \
-d '"'"'{"force":true}'"'"' | head -c 1200; echo'
# 安裝 system integration package
kubectl -n elastic exec -it $KB_POD -- bash -lc \
'curl -sk -u "elastic:'"$PASS"'" \
-H "kbn-xsrf: true" -H "Content-Type: application/json" \
-X POST "https://localhost:5601/api/fleet/epm/packages/system" \
-d '"'"'{"force":true}'"'"' | head -c 1200; echo'
```
# data retention 保命機制
```bash=
❯ curl -sk -u "elastic:${PASS}" -H 'Content-Type: application/json' \
-X PUT "https://es-es-http.elastic.svc:9200/_data_stream/logs-kubernetes.container_logs-pdng/_lifecycle" \
-d '{
"data_retention": "6h"
}'
❯ curl -sk -u "elastic:${PASS}" -H 'Content-Type: application/json' \
-X PUT "https://es-es-http.elastic.svc:9200/_data_stream/metrics-kubernetes.container_logs-pdng/_lifecycle" \
-d '{
"data_retention": "6h"
}'
curl -sk -u "elastic:${PASS}" -H 'Content-Type: application/json' \
-X PUT "https://es-es-http.elastic.svc:9200/_cluster/settings" \
-d '{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "75%",
"cluster.routing.allocation.disk.watermark.high": "85%",
"cluster.routing.allocation.disk.watermark.flood_stage": "90%"
}
}'
```
# Elastic Fleet + Kubernetes (PDNG) Setup – Full Troubleshooting & Resolution Guide
> 本文件整理了整個 **Elastic Stack (ECK) + Fleet Server + Kubernetes Elastic Agent (PDNG namespace)** 的實際排錯與最終可行配置,可作為 **重建 / GitOps / Onboarding** 的標準 README。
---
## 目標
* 在 AKS / Kubernetes 上部署:
* Elasticsearch (ECK)
* Kibana (ECK)
* Fleet Server (Elastic Agent – Deployment)
* Elastic Agent (DaemonSet)
* 使用 **Fleet Mode** 收集:
* `pdng` namespace 的 **container logs**
* Kubernetes **metrics / events**
* 確保資料可在 **Discover / Dashboards** 中查詢
---
## 一、最終架構
```
Kibana (ECK)
└── Fleet
├── Fleet Server (Agent / Deployment)
└── Elastic Agent (DaemonSet)
├── logs-kubernetes.container_logs
├── metrics-kubernetes.*
└── namespace = pdng
```
---
## 二、前置條件
* Kubernetes Cluster (AKS)
* 已安裝 ECK Operator
* Namespace: `elastic`
* 可使用 `kubectl exec` 進入 Kibana / ES Pod
---
## 三、部署順序(**非常重要**)
### 1️⃣ Elasticsearch (ECK)
```bash
kubectl -n elastic apply -f elasticsearch.yaml
```
確認:
```bash
kubectl -n elastic get es
```
---
### 2️⃣ Kibana (ECK)
**注意:`xpack.fleet.fleetServerHosts` 格式必須是 array of string**
```yaml
xpack.fleet.enabled: true
xpack.fleet.agents.enabled: true
xpack.fleet.fleetServerHosts:
- https://fleet-server-agent-http.elastic.svc:8220
```
部署後確認 Kibana 正常:
```bash
kubectl -n elastic get pod -l kibana.k8s.elastic.co/name=kb
```
---
### 3️⃣ 初始化 Fleet
```bash
PASS=$(kubectl -n elastic get secret es-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')
KB_POD=$(kubectl -n elastic get pod -l kibana.k8s.elastic.co/name=kb -o jsonpath='{.items[0].metadata.name}')
kubectl -n elastic exec -it $KB_POD -- bash -lc \
'curl -sk -u "elastic:'"$PASS"'" -H "kbn-xsrf: true" -X POST https://localhost:5601/api/fleet/setup'
```
預期結果:
```json
{"isInitialized": true}
```
---
## 四、Fleet Server
### 4️⃣ 建立 Fleet Server Policy
```bash
curl -sk -u "elastic:$PASS" -H "kbn-xsrf: true" \
-X POST https://localhost:5601/api/fleet/agent_policies \
-H "Content-Type: application/json" \
-d '{
"name": "fleet-server-policy",
"namespace": "default",
"has_fleet_server": true,
"monitoring_enabled": ["logs","metrics"]
}'
```
---
### 5️⃣ 部署 Fleet Server (Agent – Deployment)
```yaml
spec:
mode: fleet
fleetServerEnabled: true
policyID: <FLEET_SERVER_POLICY_ID>
```
確認:
```bash
kubectl -n elastic get agent fleet-server
```
狀態應為:`green`
---
## 五、PDNG Elastic Agent(DaemonSet)
### 6️⃣ 建立 PDNG Agent Policy
```bash
curl -sk -u "elastic:$PASS" -H "kbn-xsrf: true" \
-X POST https://localhost:5601/api/fleet/agent_policies \
-H "Content-Type: application/json" \
-d '{
"name": "k8s-pdng-logs",
"namespace": "pdng",
"monitoring_enabled": ["logs","metrics"]
}'
```
---
### 7️⃣ 安裝 Integration Packages
```bash
# Kubernetes
curl -sk -u "elastic:$PASS" -H "kbn-xsrf: true" \
-X POST https://localhost:5601/api/fleet/epm/packages/kubernetes \
-H "Content-Type: application/json" -d '{"force":true}'
# System
curl -sk -u "elastic:$PASS" -H "kbn-xsrf: true" \
-X POST https://localhost:5601/api/fleet/epm/packages/system \
-H "Content-Type: application/json" -d '{"force":true}'
```
---
### 8️⃣ 建立 Kubernetes Package Policy(**關鍵**)
⚠️ Fleet API **不能用簡化 inputs**,必須完整 vars
👉 **實務建議:第一次一定用 Kibana UI 建立,之後再 export JSON**
最小可用邏輯:
* `filestream` → container logs
* `kubernetes/metrics` → state / events
(實際 JSON 請以 UI export 為準)
---
### 9️⃣ 部署 Elastic Agent (DaemonSet)
```yaml
spec:
mode: fleet
policyID: <PDNG_POLICY_ID>
daemonSet:
podTemplate:
spec:
serviceAccountName: elastic-agent
hostNetwork: true
```
確認:
```bash
kubectl -n elastic get agent
```
```text
elastic-agent-pdng green 3 / 3
```
---
## 六、驗證
### Agent Policies
```bash
curl -sk -u "elastic:$PASS" -H "kbn-xsrf: true" \
https://localhost:5601/api/fleet/agent_policies | jq '.items[] | {name,namespace,agents}'
```
### Indices
```bash
curl -sku "elastic:$PASS" https://localhost:9200/_cat/indices/logs-kubernetes*?v
curl -sku "elastic:$PASS" https://localhost:9200/_cat/indices/metrics-kubernetes*?v
```
### Discover
* Index pattern:`logs-*`, `metrics-*`
* Filter:`kubernetes.namespace : pdng`
---
## 七、常見錯誤總結
| 錯誤 | 原因 |
| ------------------------------------- | ----------------------------------- |
| `fleetServerHosts parse error` | Kibana config 結構錯誤 |
| `Unsupported saved object type` | Fleet object 不能直接刪 ES index |
| `FLEET_URL is required` | Agent 非 Fleet Server 卻未指定 Fleet URL |
| `inputs.undefined.vars.* is required` | API 建 package policy 未補 vars |
---
## 八、最佳實務(重點)
✅ Fleet 第一次 **一定用 UI 建立 Integration**
✅ 再用 API / GitOps 固化
❌ 不要嘗試手寫最小 inputs(會卡 validation)
---
## 完成 🎉
此時:
* Fleet Server:`green`
* Elastic Agent (pdng):`green`
* Discover / Dashboard 可看到 pdng logs & metrics
---
*This README was generated from a real-world production troubleshooting session.*
# Elastic Agent + Fleet on AKS — Kubernetes Logs & Metrics Troubleshooting Guide
> **Status: ✅ Resolved**
>
> This document summarizes the full end‑to‑end troubleshooting process that led to successfully collecting
> `logs-kubernetes*` and `metrics-kubernetes*` data streams using **Elastic Agent (Fleet mode)** on **AKS**.
---
## 🧩 Problem Statement
Even though:
- Elastic Agent was **HEALTHY**
- Fleet showed the Kubernetes integration **enabled**
- Elasticsearch output was reachable
- No fatal errors appeared in Kibana UI
❌ **No `logs-kubernetes*` or `metrics-kubernetes*` data streams were created**
❌ `_data_stream` API returned nothing
❌ Discover showed no Kubernetes data
---
## 🧠 Root Causes (Multiple, Layered)
This issue was **not a single misconfiguration**, but a combination of Fleet, Kubernetes, and RBAC behaviors.
### 1. Fleet Integration UI ≠ Active Policy Revision
- Even if all toggles are ON, Fleet **does NOT re-render inputs** unless:
- The integration is explicitly **Saved**
- A new **policy revision** is generated
➡️ Result: Agent was running, but **no Kubernetes inputs were actually applied**
---
### 2. Missing ServiceAccount Token in Agent Pods
Errors observed:
```
reading bearer token file: /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
```
Cause:
- `automountServiceAccountToken` was disabled (default in some hardened setups)
Impact:
- Kubernetes metrics inputs failed immediately
- No data streams were created
---
### 3. Insufficient RBAC for Cluster‑Scoped Resources
Errors observed:
```
cannot list resource "nodes"
cannot get resource "leases.coordination.k8s.io"
```
Cause:
- Elastic Agent requires **cluster‑level RBAC**
- Especially for:
- `nodes`
- `leases` (leader election)
---
## ✅ Final Working Architecture
```
Elastic Agent (DaemonSet, Fleet mode)
├─ ServiceAccount token mounted
├─ ClusterRole + ClusterRoleBinding
├─ Kubernetes integration saved (policy revision bumped)
└─ Data streams auto‑created:
- logs-kubernetes.container_logs-pdng
- metrics-kubernetes.node-pdng
- metrics-kubernetes.pod-pdng
- metrics-kubernetes.state_*
```
---
## 🛠️ Step‑by‑Step Fix (Authoritative)
### ✅ Step 1 — Fix ServiceAccount Token Mount
```bash
kubectl -n elastic patch agent elastic-agent-pdng --type=merge -p '{
"spec": {
"daemonSet": {
"podTemplate": {
"spec": {
"automountServiceAccountToken": true
}
}
}
}
}'
kubectl -n elastic patch sa elastic-agent --type=merge -p '{
"automountServiceAccountToken": true
}'
```
Restart agent:
```bash
kubectl -n elastic rollout restart ds/elastic-agent-pdng-agent
```
---
### ✅ Step 2 — Verify RBAC (Must Be YES)
```bash
kubectl auth can-i list nodes --as=system:serviceaccount:elastic:elastic-agent
kubectl auth can-i get leases.coordination.k8s.io -n elastic --as=system:serviceaccount:elastic:elastic-agent
```
If **NO**, fix ClusterRole / ClusterRoleBinding.
---
### ✅ Step 3 — **Critical Step: Re‑Save Fleet Integration**
> Even if everything looks enabled — **SAVE IT AGAIN**
In Kibana:
```
Fleet → Agent Policies → Kubernetes integration → Save integration
```
Why:
- Forces new policy revision
- Re-renders inputs
- Triggers agent reload
⚠️ This step is **mandatory**.
---
### ✅ Step 4 — Verify Agent Runtime Config
```bash
kubectl -n elastic exec -it <agent-pod> -- sh -lc '
elastic-agent status
elastic-agent inspect components --show-config | egrep "dataset: kubernetes|data_stream" | head -n 50
'
```
Expected:
```
dataset: kubernetes.container_logs
dataset: kubernetes.node
dataset: kubernetes.pod
```
---
### ✅ Step 5 — Confirm Data Streams Exist
```bash
curl -sk -u "elastic:${PASS}" "https://es-es-http.elastic.svc:9200/_data_stream" | jq -r '.data_streams[].name' | egrep "logs-kubernetes|metrics-kubernetes"
```
✅ If present → pipeline is fully functional.
---
## 🔍 Validation Signals (Green Flags)
- Discover shows Kubernetes logs
- `_data_stream` lists logs/metrics
- Agent logs show `acked` events
- No more `403`, `forbidden`, `no configuration provided`
---
## ❗ Common Pitfalls (Avoid These)
| Mistake | Result |
|------|------|
| Not re‑saving integration | No inputs applied |
| No SA token | Metrics always fail |
| Namespace‑only RBAC | Node metrics forbidden |
| Assuming UI = runtime | Silent failure |
---
## 🏁 Final Verdict
✅ Configuration: **Correct**
✅ RBAC: **Correct**
✅ Fleet Policy: **Applied**
✅ Elastic Agent: **Healthy & Publishing**
🎉 **Kubernetes logs and metrics fully operational**
---
## 📎 Notes
- Tested on:
- Elastic Stack 8.13.4
- AKS
- ECK + Fleet Server
- Applies to:
- Logs
- Metrics
- Future Kubernetes integrations
---
**Author:** Troubleshooting session distilled from live production debugging
**Status:** Battle‑tested ✔️