# 2025 TAM Day | OpenShift EDA + OCP MCP server ## 0. 環境介紹 1. Go to the info page click on VScode server link, use password provided in the table to login. ![](https://hackmd.io/_uploads/H13pTi4zT.png) 2. Go to info page click on Automation controller link, use username and password provided in the table to login. ![](https://hackmd.io/_uploads/HyqR6s4GT.png) 3. Go to the Automation controller console and click on jobs under "Automation Excution". Jobs section will show all triggered job templates status by the EDA. ![image](https://hackmd.io/_uploads/BkhcJ0fOgg.png) 4. Go back to the VScode server console and open a terminal. ![](https://hackmd.io/_uploads/SyfeRiVfT.png) 5. Split terminal so that we can execute and observe the command output side by side. ![](https://hackmd.io/_uploads/rJzR0i4Mp.png) ## 1. 自動新增Resource Quota ## 2. 自動備份PVC ## 3. 自動設定Ingress加密 ## 4. 自動收集除錯log (OCP EDA for auto oc adm inspect) ### Add bastion info in AAP 1. Create Inventory - Bastion ![Screenshot 2025-08-08 at 9.37.56 AM](https://hackmd.io/_uploads/B1vvr0Mdlx.jpg) ![](https://hackmd.io/_uploads/ry9mzwY1T.png) 2. Add host in the inventory ![Screenshot 2025-08-08 at 9.40.50 AM](https://hackmd.io/_uploads/Hyh6HRf_ee.jpg) ![](https://hackmd.io/_uploads/B1WhGvtJT.png) ![](https://hackmd.io/_uploads/H1pIbPY1a.png) 3. Create credential for bastion ![Screenshot 2025-08-08 at 9.41.49 AM](https://hackmd.io/_uploads/BJnZUCzOee.jpg) ![Screenshot 2025-08-08 at 9.44.10 AM](https://hackmd.io/_uploads/SyojI0M_ge.jpg) 4. Create Host group in bastion inventory and add bastion to the host group ![image](https://hackmd.io/_uploads/BJ175QXdel.png) ![image](https://hackmd.io/_uploads/BJBRDX7Oel.png) ![image](https://hackmd.io/_uploads/BkEm_7m_ex.png) ![image](https://hackmd.io/_uploads/r1SsdX7_ge.png) ![image](https://hackmd.io/_uploads/BkVxtmm_xe.png) ![image](https://hackmd.io/_uploads/HJNWFQQOgg.png) ### Create new playbook for oc adm inspect ![](https://hackmd.io/_uploads/HJkjidtJT.png) ![](https://hackmd.io/_uploads/Bk6siOt1T.png) 1. Clone event-driven-andible repo in your VScode termianl ``` cd ~ git clone https://gitea.apps.cluster-7f74v.7f74v.sandbox734.opentlc.com/lab-user/event-driven-ansible.git ``` ![image](https://hackmd.io/_uploads/B1TgHeXuxx.png) 2. Under event-driven-ansible/automation_controller create playbook `oc-inspect.yml` ![image](https://hackmd.io/_uploads/ry3mHx7_eg.png) oc-inspect.yml ```yaml= - name: oc adm inspect hosts: bastion gather_facts: no vars: ns: "{{ ansible_eda.event.resource.metadata.namespace }}" tasks: - name: Create inspect file shell: "oc adm inspect ns/{{ ns }} --kubeconfig /home/lab-user/.kube/config" register: lsout ``` 3. push git project ``` cd ~/event-driven-ansible git add * git commit -am "playbook for oc adm inspect" git push ``` VScode will ask you to login to gittea. ![image](https://hackmd.io/_uploads/SkdewgXdlg.png) After push completed, Check your commit ID. ![Screenshot 2025-08-08 at 9.48.39 AM](https://hackmd.io/_uploads/BJ65w0Gdeg.jpg) ### Create job template in AAP 1. git server 更新AAP Project ![Screenshot 2025-08-08 at 9.47.25 AM](https://hackmd.io/_uploads/BkoswCzOlg.jpg) 新增template取名為 `oc-inspect` ![Screenshot 2025-08-08 at 9.49.37 AM](https://hackmd.io/_uploads/S11ydRfOlg.jpg) ![Screenshot 2025-08-08 at 9.51.14 AM](https://hackmd.io/_uploads/rkOO_CfOee.jpg) ### Create new rulebook On Vscode console, switch to eda-rulebooks/rulebooks and create a file named `oc-inspect.yml` ![image](https://hackmd.io/_uploads/rJhIIeQ_ee.png) Edit oc-inspect.yml with following ```yaml= --- - name: Listen for unhealthy+warning event hosts: all sources: - sabre1041.eda.k8s: api_version: v1 kind: Event namespace: jace #自行替換成新預計的ns名稱 rules: - name: Debug condition: event.resource.reason == "Unhealthy" and event.resource.type == "Warning" throttle: once_within: 5 minutes group_by_attributes: - event.resource.metadata.namespace - event.resource.involvedObject.name action: run_job_template: name: oc-inspect #必須對應AAP內的template 名稱 organization: Default ``` ![](https://hackmd.io/_uploads/rkEmD6xzT.png) #### Commit and push your rulebook In the vscode terminal ```bas= cd ~/eda-rulebooks git add * git commit -am "rulebook for oc adm inspect" git push ``` VScode will ask you to login to gittea. ![image](https://hackmd.io/_uploads/SkdewgXdlg.png) ### Create Rulebook Activcation in AAP #### Sync-Rulebook project again ![image](https://hackmd.io/_uploads/rkQ4uemdlg.png) Click Create `rulebook activation` ![image](https://hackmd.io/_uploads/BJjWOgQ_ll.png) Select anf fill like following example ![image](https://hackmd.io/_uploads/Bk7qOxmulg.png) After that you can see your rule is activated ![image](https://hackmd.io/_uploads/r1kndxmOel.png) ### 測試 (建立一個會自動probe fail的pod) On you VScode termianl ```bash= oc new-project jace cat << EOF | oc apply -f - apiVersion: v1 kind: Pod metadata: labels: test: liveness name: liveness-exec spec: containers: - name: liveness securityContext: allowPrivilegeEscalation: false seccompProfile: type: RuntimeDefault capabilities: drop: - ALL resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" image: k8s.gcr.io/busybox args: - /bin/sh - -c - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600 livenessProbe: exec: command: - cat - /tmp/healthy initialDelaySeconds: 5 periodSeconds: 5 EOF ``` ![](https://hackmd.io/_uploads/ryGBC_typ.png) ![](https://hackmd.io/_uploads/rkbtCOKy6.png) ### 驗證inspect是否成功 ![](https://hackmd.io/_uploads/HkHPeYY16.png) ![](https://hackmd.io/_uploads/SJBulFYJ6.png) ## 5. [Option] 自動打包inspect file並建立support case 0. 於Red Hat網頁生成你的Red Hat Token https://access.redhat.com/management/api ![](https://hackmd.io/_uploads/ryQsZyBMT.png) 1. 接續lab 4, 將`oc-inspect.yml` 替換成以下內容 ```yaml= - name: oc adm inspect hosts: bastion gather_facts: no vars: ns: "{{ ansible_eda.event.resource.metadata.namespace }}" tasks: - name: Create inspect file shell: "rm -rf ./inspect.local.{{ ns }} && oc adm inspect ns/{{ ns }} --dest-dir=./inspect.local.{{ ns }} --kubeconfig /home/lab-user/.kube/config" register: lsout - name: Compress with tar command: "tar -czf inspect.local.{{ ns }}.tar.gz inspect.local.{{ ns }}" when: lsout.rc == 0 - name: Create support case shell: | RH_PORTAL_TOKEN=$(<rh-customer-portal-token) //替換TOKEN TOKEN=$(curl https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token -d grant_type=refresh_token -d client_id=rhsm-api -d refresh_token=$RH_PORTAL_TOKEN | jq --raw-output .access_token) response=$(curl -sS -X POST -H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" --data '{ "product": "OpenShift Container Platform", "version": "4.12", "caseType": "RCA Only", "description": "My pod crashed last night, I was wondering about RCA", "environment": "staging", "caseLanguage": "zh_TW", "severity": 3, "summary": "Summary message here." }' "https://api.access.redhat.com/support/v1/cases") echo $response | jq -r '.location[0] | capture("/cases/(?<case_no>[0-9]+)") | .case_no' register: case_number when: lsout.rc == 0 - name: Upload logs to support case shell: | RH_PORTAL_TOKEN=$(<rh-customer-portal-token) TOKEN=$(curl https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token -d grant_type=refresh_token -d client_id=rhsm-api -d refresh_token=$RH_PORTAL_TOKEN | jq --raw-output .access_token) CASE_NO={{ case_number.stdout }} curl -X POST -F "file=@inspect.local.{{ ns }}.tar.gz" -H "Authorization: Bearer $TOKEN" https://api.access.redhat.com/support/v1/cases/${CASE_NO}/attachments when: lsout.rc == 0 and case_number.stdout is defined ``` 2. 將原本的範例pod 替換成Deployment以及Configmap **html-configmap.yaml** ```bash cat << EOF | oc apply -f - apiVersion: v1 kind: ConfigMap metadata: name: html-content data: index.html: | <h1>Hello world! Welcome to K8s Summit 2023</h1> EOF ``` ```bash= # nginx-hello-world.yaml cat << EOF | oc apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: nginx-hello-world labels: app: nginx-hello-world spec: replicas: 1 selector: matchLabels: app: nginx-hello-world template: metadata: labels: app: nginx-hello-world spec: volumes: - name: html-volume configMap: name: html-content containers: - name: nginx image: "quay.io/redhattraining/hello-nginx:v1.0" securityContext: allowPrivilegeEscalation: false runAsNonRoot: true seccompProfile: type: RuntimeDefault capabilities: drop: - ALL volumeMounts: - name: html-volume mountPath: /usr/share/nginx/html resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" livenessProbe: exec: command: - /bin/sh - -c - curl -s http://localhost:8080 | grep -q "world" initialDelaySeconds: 5 periodSeconds: 5 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File tty: true stdin: true serviceAccount: default terminationGracePeriodSeconds: 5 EOF ``` ```bash= oc expose deploy/nginx-hello-world --port 8080 oc expose svc nginx-hello-world ``` ## 6.0 EDA自動串接OCP MCP server進行錯誤分析 ### [On Bastion or your VScode terminal] Install Gemini-CLI 1. Install Node.JS ``` # Download and install nvm: curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash # in lieu of restarting the shell \. "$HOME/.nvm/nvm.sh" # Download and install Node.js: nvm install 22 # Verify the Node.js version: node -v # Should print "v22.17.1". nvm current # Should print "v22.17.1". # Verify npm version: npm -v # Should print "10.9.2". ``` 3. Install Gemini-CLI ``` npm install -g @google/gemini-cli ``` 4. Get your API-Key https://aistudio.google.com/app/apikey ![image](https://hackmd.io/_uploads/Hk-hX7kPgg.png) ``` export GEMINI_API_KEY="<你的API-KEY>" echo 'export GEMINI_API_KEY="<你的API-KEY>"' >> ~/.bashrc ``` 5. Edit Gemini setting ``` mkdir .gemini vi ~/.gemini/settings.json { "theme": "GitHub", "mcpServers": { "kubernetes": { "command": "npx", "args": [ "-y", "rh-tam-kubernetes-mcp-server@latest" ] } } } ``` ### 編寫Playbook (讓gemini自動分析指定NS底下的問題) 於VScode 專案路徑event-driven-ansible/automation_controller底下新增`gemini-analyze.yml` ![image](https://hackmd.io/_uploads/HkH_J-QOeg.png) ```yaml= - name: Run gemini to troubleshoot given NS hosts: bastion gather_facts: no vars: ns: "{{ ansible_eda.event.resource.metadata.namespace }}" tasks: - name: Gemini analyze NS shell: 'gemini -p "分析目前OpenShift內namespace {{ ns }} 有什麼異常 (只需要重點整理,不顯示推論過程)"' register: lsout - name: Show command output debug: var: lsout.stdout_lines ``` 於Vscode termial ```bash= cd ~/event-driven-ansible git add * git commit -am "Add gemini_analyze playbook" git push ``` Resyce Templete project ![image](https://hackmd.io/_uploads/rJy1Rg7dgl.png) Create a new Template called gemini-analyze ![image](https://hackmd.io/_uploads/SkO-AeXdxg.png) ![Screenshot 2025-08-08 at 12.41.57 PM](https://hackmd.io/_uploads/SkoBlZ7_lx.jpg) ### 編寫Rulebook (設定觸發後搜集oc inspect並讓gemini分析問題) 於VScode 專案路徑eda-ruleboos/ruleboos底下新增`oc-inspce-analyze.yml` ![image](https://hackmd.io/_uploads/rkpchxXuel.png) ```ya= --- - name: Listen for unhealthy+warning event hosts: all sources: - sabre1041.eda.k8s: api_version: v1 kind: Event namespace: jace #自行替換成新預計的ns名稱 rules: - name: Debug condition: event.resource.reason == "Unhealthy" and event.resource.type == "Warning" throttle: once_within: 5 minutes group_by_attributes: - event.resource.metadata.namespace - event.resource.involvedObject.name actions: - run_job_template: name: oc-inspect #必須對應AAP內的template 名稱 organization: Default - run_job_template: name: gemini-analyze organization: Default ``` ```bash= cd ~/eda-rulebooks git add * git commit -am "add inspect_analyze rulebook" git push ``` #### Sync-Rulebook project again ![image](https://hackmd.io/_uploads/rkQ4uemdlg.png) #### Create new rule activecation ![image](https://hackmd.io/_uploads/HJt_b-7uxg.png) ![Screenshot 2025-08-08 at 12.48.26 PM](https://hackmd.io/_uploads/SkZ0WW7dll.jpg) You should see this new rule is activated ![image](https://hackmd.io/_uploads/rJs-fWQugg.png) Disable the privouse role - oc-inspce ![image](https://hackmd.io/_uploads/rJn_XbmOgl.png) ![Screenshot 2025-08-08 at 12.56.23 PM](https://hackmd.io/_uploads/Bky5XbX_eg.jpg) Let's deploy the pod again to trigger the event. ``` oc delete pod liveness-exec --force cat << EOF | oc apply -f - apiVersion: v1 kind: Pod metadata: labels: test: liveness name: liveness-exec spec: containers: - name: liveness securityContext: allowPrivilegeEscalation: false seccompProfile: type: RuntimeDefault capabilities: drop: - ALL resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" image: k8s.gcr.io/busybox args: - /bin/sh - -c - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600 livenessProbe: exec: command: - cat - /tmp/healthy initialDelaySeconds: 5 periodSeconds: 5 EOF ``` After a whele, you should see the job inspect and gemini-analyze was triggered in a role. ![image](https://hackmd.io/_uploads/Sk1pmbQdex.png) ![messageImage_1754637213635](https://hackmd.io/_uploads/BkRzVXQdlg.jpg) ## 7. gemini 總結問題並 create support case ### 設定環境 - 取得 Red Hat API Tokens - 設定 bastion 環境 ``` echo 'export RH_PORTAL_TOKEN="<你的RH-API-TOKEN>"' >> ~/.bashrc ``` ## 創建 playbook - 創建 automation_controller/oc-inspect-create-case.yml ``` - name: Run gemini to troubleshoot given NS hosts: bastion gather_facts: no vars: ns: "{{ ansible_eda.event.resource.metadata.namespace }}" tasks: - name: Create inspect file shell: "oc adm inspect ns/{{ ns }} --dest-dir=./inspect.local.{{ ns }} --kubeconfig /home/lab-user/.kube/config" register: inspect_file - name: Compress with tar command: "tar -czf inspect.local.{{ ns }}.tar.gz inspect.local.{{ ns }}" when: inspect_file.rc == 0 - name: Gemini analyze NS shell: 'gemini -p "分析目前OpenShift內namespace {{ ns }} 有什麼異常 (只需要單行重點整理,不顯示推論過程)"' register: lsout - name: Show command output debug: var: lsout.stdout_lines - name: create support case shell: 'gemini -p "開一個 OCP support case, 標題是[test case] My pod crashed last night, I was wondering about RCA, 描述為 {{lsout.stdout_lines}},並且將檔案 inspect.local.{{ ns }}.tar.gz 上傳為附件"' register: case_output - name: show create case result debug: var: case_output ``` - sync project ![image](https://hackmd.io/_uploads/BJYDwHmOll.png) - create job template ![image](https://hackmd.io/_uploads/SJqEPB7_le.png) ### 創建 rulebook - playbook ``` vi ~/eda-rulebooks/rulebooks/oc_inspect_support_case.yml --- - name: Listen for unhealthy+warning event hosts: all sources: - sabre1041.eda.k8s: api_version: v1 kind: Event namespace: jace #自行替換成新預計的ns名稱 rules: - name: Debug condition: event.resource.reason == "Unhealthy" and event.resource.type == "Warning" throttle: once_within: 5 minutes group_by_attributes: - event.resource.metadata.namespace - event.resource.involvedObject.name actions: - run_job_template: name: oc-inspect-create-case organization: Default ``` - 上傳 ``` cd ~/eda-rulebooks/ git add . git commit -m "add oc inspect create support case rulebook" git push ``` - sync project ![image](https://hackmd.io/_uploads/Sk29wBmOge.png) - create rulebook activation ![image](https://hackmd.io/_uploads/B13kdBmOgg.png) ### 測試 ``` oc delete pod liveness-exec --force cat << EOF | oc apply -f - apiVersion: v1 kind: Pod metadata: labels: test: liveness name: liveness-exec spec: containers: - name: liveness securityContext: allowPrivilegeEscalation: false seccompProfile: type: RuntimeDefault capabilities: drop: - ALL resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" image: k8s.gcr.io/busybox args: - /bin/sh - -c - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600 livenessProbe: exec: command: - cat - /tmp/healthy initialDelaySeconds: 5 periodSeconds: 5 EOF ``` ![image](https://hackmd.io/_uploads/H1xpbI7dxx.png) ## 參考資料 rulebook 寫法 https://www.redhat.com/en/topics/automation/what-is-an-ansible-rulebook https://ansible.readthedocs.io/projects/rulebook/en/stable/rules.html