How to Build a Healthy On-Call Culture

# How to Build a Healthy On-Call Culture 主筆：Brian（16:00） ###### tags: `SRE2022` --- **Brian** ## agenda - monitor - organization type (依團隊規模大小不同做法) - on call - incident response - cause analysis ## monitor why need - dont know health - no notice when bad happen - no investigate issue after incident - no monitor -> no measure -> no quality common type - End 2 End - functionality - performance - error tracking => network => ? external & internal monitor - 一定要有外部偵測，避免盲點 - monitoring as code - internal: argo cd + prometheous - external: terraform + datadog, pingdom ## org size mai coin - 100p⬇️ org geo distribution (透過地理特性把服務處理達到 24x7 而不用特別排班處理) tranditional => dev | operation now: sre+em+se => devops mai-coin: sre lead - sre + em + se - sre + em + se (此模式優點在於共享 SRE tech stack) ### on call duty - routine op job - handle incident - runbook refine/writing - weekly on call report 負責坦怪 -> mr.yoga > 哭啊 ### oncall model error -> (application) -> se -> se -> em -> (infra) -> sre -> sre -> em Two track: 由於都 Config 自動化了，在初步偵測後可初步分別是 application 還是 infra 問題，再依分析去安排對應 oncall 處理人員 ## notification problem 1. get notify 2. arrange job 3. ? ### notify sys arch 圖 ### notify sys - rotation - on call model - rotation freq - fail to answer a page 串google cal api -> error -> 會議主聯絡人 -> 會議2聯絡人 ### notify sys - communication - slack user group break the communication cap - check document -> check calendar -> tag group -> ? ### notify sys - effectiveness 1. contact phone book -> slack command w/ yser group 2. short phone call notice -> phone call w/ validation 3. notification visibility -> detail status in slack ## incident response >aware incident b4 customer （一樣是壞掉）先講，使用者感受度好很多 ### incident handle terminology role - incedent commander (主 caller) - tech lead - communication lead（對外溝通 - engineeering manager incident lv (嚴重程度)，定義好影響到使用者的層面 - s3 - s2 - s1 - s0 (最嚴重) ### incident handle process 圖初始事故事件 -> 決定事故等級 -> 決定是否開啟維護頁 -> 安排對外聲明 -> 解決、辨別主因、確認修復 ### incident handle - internal 圖 ### incident handle - external - service health page - mabile app push notification - ? ## root cause analysis (RCA) 不是要找戰犯->避免事件再次發生 ### how to RCA - incident timeline - find root cause - follow up / corrective action ### blameless culture > 例：今天對事不對人，但問題都是yoga造成的 --- ## Chris ### Monitoring monitoring -> measuring data -> analysis #### External monitoring - DNS - Certificate - Global latency - CDN ### Organization architecture 各team都有SRE 各team SRE對應一個Engineering Manager 依規模決定on call模式 ### On call - Routine job - 處理意外 - 改進SOP - 每週報告 #### Notification system 監控系統事件觸發webhook -> lambda傳送事件給Google Calendar或Slack或Twilio -> 通知 on-call 的人 ### incident response 核心：比使用者先知道系統壞掉 - incident level - handling process - visibility - internal - visibility - external ### cause analysis - timeline - root cause - improving action ---