# Kubernetes Container Logs – ILM Tiering + SLM Snapshots (ECK on AKS) This README documents what was configured for the **Fleet-managed Kubernetes container logs** data stream: - **Data stream:** `logs-kubernetes.container_logs-pdng` - **Goal:** keep hot/warm/cold tiering for performance & cost control, automatically delete old indices, and still keep a longer **backup** in Azure Blob via **SLM snapshots**. - **Important constraint:** the cluster license is **not compliant for searchable snapshots**, so we **do not** use the ILM `searchable_snapshot` action. Instead we use: - **ILM** for hot → warm → cold → delete - **SLM** to snapshot indices to **Azure Blob** (`azure_repo`) - **Restore-on-demand** when you need to query data older than what ILM keeps online --- ## 1. Architecture ```mermaid flowchart TB subgraph Users["Users / Ops"] U[Kibana Discover / Dashboards] end subgraph Ingress["Ingress / LB"] IG["Ingress / LB"] end subgraph ElasticStack[Elastic Stack on ECK] subgraph ControlPlane[Control Plane] FS[Fleet Server] KBN[Kibana] end subgraph HotTier[Hot Tier] EH[Elasticsearch Hot Nodes] end subgraph WarmTier[Warm Tier] EW[Elasticsearch Warm Nodes] end subgraph ColdTier[Cold Tier] EC[Elasticsearch Cold Nodes] end end subgraph Snapshot[Snapshot Storage] AZ[(Azure Blob: azure_repo / es-snapshots)] end subgraph K8s[Kubernetes Cluster] EA["Elastic Agent (DaemonSet)"] KP["Kubelet / Metrics"] LOGS[Container Logs] end %% User access U --> IG --> KBN %% Fleet control KBN --> FS FS --> EA %% Data ingest KP --> EA LOGS --> EA EA --> EH %% ILM tiering EH -->|ILM rollover| EW EW -->|ILM migrate| EC EC -->|ILM delete @ 90d| X[(deleted)] %% Snapshots EH -.->|SLM snapshot schedule| AZ EW -.->|SLM snapshot schedule| AZ EC -.->|SLM snapshot schedule| AZ %% Restore AZ -->|Restore| R["restore-* indices (temporary)"] KBN --> R ``` **Key point:** without searchable snapshots, **Azure snapshots are not directly queryable**. To query historical data, **restore** a snapshot into temporary `restore-*` indices, query, then delete them. --- ## 2. What we changed ### 2.1 ILM: hot 7 / warm 30 / cold 90 / delete 90 (no searchable snapshot) - New ILM policy: `pdng-logs-hot7-warm30-cold90-del90` - Applied only to `logs-kubernetes.container_logs-pdng` via a **high priority override index template** - Triggered a rollover so the next backing index uses the new ILM policy Result observed: - `.ds-...-000001` stayed on the default `logs` policy (expected) - `.ds-...-000002` uses `pdng-logs-hot7-warm30-cold90-del90` (correct) ### 2.2 SLM: snapshots to Azure Blob (backup / longer retention) - Snapshot repository exists: `azure_repo` (Azure container `es-snapshots`) - SLM policy snapshots `.ds-logs-kubernetes.container_logs-pdng-*` - Snapshots are retained (example: 120 days) so you can restore data after ILM deletes it --- ## 3. Step-by-step Setup > Run commands in **Kibana → Dev Tools**, or with `curl` against Elasticsearch. ### 3.1 Confirm data stream and snapshot repo ```http GET _data_stream/logs-kubernetes.container_logs-pdng GET _snapshot/azure_repo ``` ### 3.2 Create ILM policy (no searchable snapshot) ```http PUT _ilm/policy/pdng-logs-hot7-warm30-cold90-del90 { "policy": { "phases": { "hot": { "min_age": "0ms", "actions": { "rollover": { "max_age": "7d", "max_primary_shard_size": "50gb" }, "set_priority": { "priority": 100 } } }, "warm": { "min_age": "7d", "actions": { "forcemerge": { "max_num_segments": 1 }, "set_priority": { "priority": 50 } } }, "cold": { "min_age": "30d", "actions": { "set_priority": { "priority": 0 } } }, "delete": { "min_age": "90d", "actions": { "delete": {} } } } } } ``` ### 3.3 Create an override index template for the data stream This does **not** modify Fleet-managed templates/pipelines; it only overrides ILM for the specific data stream. ```http PUT _index_template/logs-kubernetes.container_logs-pdng-ilm { "index_patterns": ["logs-kubernetes.container_logs-pdng"], "priority": 500, "data_stream": {}, "template": { "settings": { "index.lifecycle.name": "pdng-logs-hot7-warm30-cold90-del90" } }, "_meta": { "description": "Override ILM policy for logs-kubernetes.container_logs-pdng only (keep Fleet pipelines/mappings)" } } ``` ### 3.4 Validate the template override (simulate) ```http POST _index_template/_simulate_index/logs-kubernetes.container_logs-pdng ``` Expected: - `index.lifecycle.name` is `pdng-logs-hot7-warm30-cold90-del90` - Overlapping templates exist (normal), but the final lifecycle name is ours ### 3.5 Rollover once to apply the new policy to a fresh backing index ```http POST logs-kubernetes.container_logs-pdng/_rollover ``` ### 3.6 Verify ILM applied to the new generation ```http GET _data_stream/logs-kubernetes.container_logs-pdng GET .ds-logs-kubernetes.container_logs-pdng-*/_ilm/explain ``` Expected: - `...-000001` policy: `logs` - `...-000002` policy: `pdng-logs-hot7-warm30-cold90-del90` --- ## 4. SLM (Snapshots) Setup ### 4.1 Create an SLM policy (daily snapshots) Example: daily at 01:30, keep snapshots for 120 days. ```http PUT _slm/policy/slm-pdng-container-logs-daily { "schedule": "0 30 1 * * ?", "name": "<slm-pdng-container-logs-{now/d}>", "repository": "azure_repo", "config": { "indices": [ ".ds-logs-kubernetes.container_logs-pdng-*" ], "ignore_unavailable": true, "include_global_state": false }, "retention": { "expire_after": "120d", "min_count": 7, "max_count": 200 } } ``` ### 4.2 Trigger a snapshot immediately (test) ```http POST _slm/policy/slm-pdng-container-logs-daily/_execute ``` Check status: ```http GET _slm/status GET _slm/policy/slm-pdng-container-logs-daily GET _snapshot/azure_repo/_all ``` --- ## 5. How to query data after ILM deletes it (Restore-on-demand) ### 5.1 List available snapshots ```http GET _snapshot/azure_repo/_all ``` Pick the snapshot name you want (e.g., `slm-pdng-container-logs-2026.02.01`). ### 5.2 Restore into temporary indices (`restore-*`) and allocate to cold tier Replace `<SNAPSHOT_NAME>`. ```http POST _snapshot/azure_repo/<SNAPSHOT_NAME>/_restore { "indices": ".ds-logs-kubernetes.container_logs-pdng-*", "rename_pattern": ".ds-logs-kubernetes.container_logs-pdng-(.+)", "rename_replacement": "restore-logs-k8s-pdng-$1", "include_global_state": false, "index_settings": { "index.routing.allocation.include._tier_preference": "data_cold" } } ``` ### 5.3 Query in Kibana In **Discover**: - Use index pattern: `restore-*` (or filter `_index: restore-*`) - Queries work the same as live logs (e.g., `kubernetes.container.name : "pdng-cosmosrestapi"`) ### 5.4 Cleanup after investigation ```http DELETE restore-logs-k8s-pdng-* ``` --- ## 6. Notes / Troubleshooting ### Searchable snapshot is blocked by license If you see an error like: - `non-compliant for [searchable-snapshots]` Then you **cannot** use ILM `searchable_snapshot`. Use SLM + restore instead (this README’s approach). ### ILM tier migration New indices are created on `data_hot` (expected). ILM will migrate older indices to warm/cold when `min_age` is reached, assuming the cluster has those tiers available. ### Old backing indices Templates only apply to newly created backing indices. If you want to change ILM policy for a previously created backing index, you can update its settings manually, but be cautious and validate with `_ilm/explain`. --- ## 7. Quick verification commands ```http POST _index_template/_simulate_index/logs-kubernetes.container_logs-pdng GET _data_stream/logs-kubernetes.container_logs-pdng GET .ds-logs-kubernetes.container_logs-pdng-*/_ilm/explain GET _slm/status GET _snapshot/azure_repo/_all ``` --- **End.**