owned this note changed 9 months ago
Linked with GitHub

Streamline Incident Management - 蕭兆洋 Charles Hsiao

歡迎來到 DevOpsDay Taipei 2024 共筆 :mega:
共筆入口:https://hackmd.io/@DevOpsDay/2024
手機版請點選上方 按鈕展開議程列表。

議程介紹

填寫議程滿意度問卷|回饋建言給辛苦的講者

》Charles Hsiao 準備的參考資料

共筆從這開始

興趣,潛水和登峰

案例分享

中午,HTTP 502 ~~~~

  1. RD, RD 發生什麼事?,自己看k8s
    a. 我 :k8s 是正常
    b. RD:k8s 是正常
  2. max網站有問題
  3. VIP max網站下單遇到問題
  4. Support max網站連不上

晚餐,HTTP 502 還在

  1. rollback,無效
  2. DBA 重新操作 DB

幣價上揚20%,客戶抓狂

Unmanaged Incident

  • 資訊分散
  • 缺乏溝通
  • 沒有SOP (db restart, backend rollback等等)

Managed incident

Incident bot

  • 可以提供 incident management SOP
    • Initial incident
    • Investigation and fixing
      • Serverity Level
        • Level 1 to 4 ()
        • Assign to others dept lead
    • verification
      • qa regression test
      • if ok, close incident
      • get feeback from varios dept
    • close incident
      • assign an engineer to purpose rca and follow up
      • RCA at least must have of these key points: time-line, impact, finding, root cause
    • post morten
      • bad apple theory vs just culture
        • Just culture - blameless culture (良性溝通,不針對人,代罪羔羊,針對問題解決)

Take away

  1. Incident mangemant SOP
  2. Incident SOP
  3. Customer Communincation (專門對外的窗口)
  4. Postmortem

Incident Middleware

  • Alert Manager
  • Slack Interactivity Webhook

What next

  • Harden Incident Middleware
  • Incident Reporting
  • Incident Management Workshop
tags: DevOpsDays Taipei 2024
Select a repo