Intelligent SRE Service - Issue Diagnosis System - 徐永吉(Eric Hsu)

# Intelligent SRE Service - Issue Diagnosis System - 徐永吉(Eric Hsu) {%hackmd @HWDC/BJOE4qInR %} >#### 》[議程介紹](https://hwdc.ithome.com.tw/2024/session-page/3325) >#### 》[填寫議程滿意度問卷｜回饋建言給辛苦的講者](https://forms.gle/17cKgMqRsRjk2wAy7) # Agenda Why is SRE impartant to 17LIVE? User experence = Stability and Trustworthiness * Service應該要在平台上有很穩定的表現 Difficulties SRE faced in 17LIVE * Diverse Peak time 夜間各地的活動就會進來，大概晚上8點到凌晨1點 Covid19 改變了使用者行為，在家工作（WFH）讓中午也有一波流量 Special event無法預測瞬間二十萬人次，必須事先Prescale，無法動態擴展（auto scale）。不只是明星，只要主播辦活動都能夠打掛系統，前提是了解用戶事前了解客戶有沒有辦活動很重要，提早預先擴展系統，Communication變得格外重要使用者看到有趣的圖就一直按，QPS一直噴高即時串流變成CDN-base的串流 # Why is SRE important to 17LIVE? 黑暗期的經歷，Workaround and rollback治百病 * 重開治百病 * 遵循blamelss -> 與責任產生衝突性 * 吃燒餅掉芝麻，還是手上只剩下芝麻，燒餅掉地上 * 80%的情況下流量是異常的，不是正常的，應該要被解決掉 * 被自己的service 異常呼叫，或是駭客攻擊造成流量 * SLA都是平均值，但一樣會被抱怨，服務恢復後他們不一定會再進來 ## 文化 The culture of 17LIVE SRE * 不管怎麼作直播都不會是人的基本需求 * SRE不知道使用者發生什麼事情 * 沒有完整的retrospective * OKRs alignment * User & SRE/開發團隊的同步. 即使看到99.99% 但user感受不好, 其實是很嚴重的事情. **opposition of blameless** - how can we quickly roll back when having crash issues - how can the scalability meet the QPS? 應該要在QPS上升的過程中要檢查其合法性 - how can the newest technology be integrated into our services? - How can we release quikly to meet users' needs 兩面刃，有可能用戶一直在suffer # The evolution of 17live SRE - Connection - Critical path protection - 要先定義什麼是critical path - 發生意外時要確保關鍵路徑能夠被保護下來 - Customer understanding - Process - Efficient CI/CD - Complete SOP - 有好的SOP的時候, 我們就有把握, 再過XXX分鐘之後就會回覆正常, 這樣就有機會讓使用者留在你的系統上, 或是"等待"你們變成正常.（帶來 incident 發生時，revenue loss 下降） - Smart incident prediction - 如果我一直期望復原如何能在用戶沒有感知的狀況下復原 - 讓使用者沒注意到之前就修復 - Culture - Responsibility（責任） - 如何告訴大家blemeless跟Responsibilty是不同的 - Requirement - Expectation - Transprancy(透明) - 被Misunderstanding 蓋住了，SRE為了快速解決問題但是外面的人不清楚進度與狀況，用戶不清楚狀況 - 用戶期待管理，溝通多久可以復原 - Vision - Prevention（預防） - Prediction（預測管理） - Perception（感知） - 感知是超過prediction的，提早一步用戶完全沒有注意到的時候就修復 - 在使用者感覺之前調查 # Connection - From Chaotic to Stabilized - 主要透過Slack「Hey! system down」「Ok, checking...」 - keep checking , checking... - 無法找到系統崩壞的程度 - 使用者反饋系統掛了，但實際上是在直播間內無法收禮物（系統沒掛） - Critical path需要檢討可能跟使用者想像的不同 - 驗證很重要！！！ - 用戶可以進入直播間、可以送禮聊天，其他無所謂 - 開播這件事情在peak hours出事時, 是'相對'不重要的 - 找到自己公司業務最重要的功能 - 工程師平常沒有玩自己的系統，開發者登進去發現自己在聊天室一直被罵 - 讓使用者知道發生什麼事情很重要，回饋速度加快 # Process - Efficient CI/CD - 上層的命令不能掛：但是不能掛的定義不清楚 - 用PreProd做流量測試，流量測試非常燒錢，有需要才做 - 引入sonarcloud，每次推code靜態檢查 # Process : Complete Incident SOP - no more 'checking', create new SOP system to report the current status. - what r u doing , how long will you take - Don't Break the expectation from stackholder - 15 min, 30 min and so on, the user keep waiting and disappointed again and again. - 從checking, checking 變成'15 分鐘就會好' '15 分鐘就會好' => 那倒不如你直接跟我說6小時..這樣我們就直接上公告. 第2次再rollback時, 就要直接說明了. - - 需要有一個ＰＭ去斷捨離做決策，需要多長時間 ## Smart issue reporiting service - 同一個issue三個團隊解，同時推上去打架 - 因為在slack tag所有人，優先序又調最高 - AI 根據歷史記錄指派，回覆Stackholder ## Process : Smart Incident Prediction Service - 高QPS時提前阻斷, 預測, 有點像是讓系統變成'部分可用' ## Culture: Promise to Our Customers - Blamelessness to Promise 需要知道Stackholder在乎什麼 - 意外總是會發生，但如何降低對商務的傷害 - Transprancy非常重要，因為使用者很討厭看到checking - Blamelessness = Stakeholder's Expectation + Transparency + User' Needs + Responsibility - So we can promise - 工程師總是負責任的，因為最終都是由工程師解決問題 ## Vision: From Passive to Proactive - Prevention - BigQuery - Prediction - Google Vertex AI - Smart incident Reporting (JIRA Reference) - Potrntial incident Dection - Partial Impact Controller - Perception - Google Gemini - Customer Perception - Feedback - Service Info - 系統在保護我們（開發者） - 從全面潰散的節點(關鍵瓶頸)改成部分失能（Partial impact） # What's the next of 17LIVE SRE? 17LIVE Stability Diagnosis Service 引入中醫的概念：望聞問切 - 望 Inspection Engine - Collect and input data from user feedback channels - 聞 Listening engine （觀察） - Understand users's issue/need - 直接用Gemini進去直播間看...主播永遠不在講人話..因為主播是一口氣對N個人說話, 都是斷句, 以前不行, 現在可以了 AI 工具的進步, 所以透過Gemini Summarize the information - Include SNS and social channel - 在LINE 在直播有人在抱怨, 系統都會知道 - 把comment / audio / video全部丟進去. - 問 Inquiring Engine （調查） - Activiely recognize system issues - 包括最近的產品發佈（變動），最可能有關聯 - 切 Palpation(觸診) Engine （處置） - Auto-alert - Make decision - Deep dive issues - Critical paths recovery - 用使用者比較不會生氣的方式溝通（我的機器沒壞，氣氣氣！） # Main contributions of 17LIVE SRE evolution - Users - 發現直播間講的會聽會改 - 信賴增加 - 改善與工程師的對立與不信任 - 增加忠誠 - Stakeholders - 透明度改善 - Engineers - 預算降低 - bug降低 - Company - 預算 - 收益 - 客戶導向開發 # From Data Driven to Data-informed SRE Services - Awareness - Insight - Vision ==以下聊天區== 17live現在還有在作直播嗎? >有呀, 台灣日本還有其他國家都有 >> 新聞現在很少看到17live的直播主意外總是會發生

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.