Hello World Dev Conference
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Help
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Intelligent SRE Service - Issue Diagnosis System - 徐永吉(Eric Hsu) {%hackmd @HWDC/BJOE4qInR %} >#### 》[議程介紹](https://hwdc.ithome.com.tw/2024/session-page/3325) >#### 》[填寫議程滿意度問卷|回饋建言給辛苦的講者](https://forms.gle/17cKgMqRsRjk2wAy7) # Agenda Why is SRE impartant to 17LIVE? User experence = Stability and Trustworthiness * Service應該要在平台上有很穩定的表現 Difficulties SRE faced in 17LIVE * Diverse Peak time 夜間各地的活動就會進來,大概晚上8點到凌晨1點 Covid19 改變了使用者行為,在家工作(WFH)讓中午也有一波流量 Special event無法預測 瞬間二十萬人次,必須事先Prescale,無法動態擴展(auto scale)。不只是明星,只要主播辦活動都能夠打掛系統,前提是了解用戶 事前了解客戶有沒有辦活動很重要,提早預先擴展系統,Communication變得格外重要 使用者看到有趣的圖就一直按,QPS一直噴高 即時串流變成CDN-base的串流 # Why is SRE important to 17LIVE? 黑暗期的經歷,Workaround and rollback治百病 * 重開治百病 * 遵循blamelss -> 與責任產生衝突性 * 吃燒餅掉芝麻,還是手上只剩下芝麻,燒餅掉地上 * 80%的情況下流量是異常的,不是正常的,應該要被解決掉 * 被自己的service 異常呼叫,或是駭客攻擊造成流量 * SLA都是平均值,但一樣會被抱怨,服務恢復後他們不一定會再進來 ## 文化 The culture of 17LIVE SRE * 不管怎麼作直播都不會是人的基本需求 * SRE不知道使用者發生什麼事情 * 沒有完整的retrospective * OKRs alignment * User & SRE/開發團隊的同步. 即使看到99.99% 但user感受不好, 其實是很嚴重的事情. **opposition of blameless** - how can we quickly roll back when having crash issues - how can the scalability meet the QPS? 應該要在QPS上升的過程中要檢查其合法性 - how can the newest technology be integrated into our services? - How can we release quikly to meet users' needs 兩面刃,有可能用戶一直在suffer # The evolution of 17live SRE - Connection - Critical path protection - 要先定義什麼是critical path - 發生意外時要確保關鍵路徑能夠被保護下來 - Customer understanding - Process - Efficient CI/CD - Complete SOP - 有好的SOP的時候, 我們就有把握, 再過XXX分鐘之後就會回覆正常, 這樣就有機會讓使用者留在你的系統上, 或是"等待"你們變成正常.(帶來 incident 發生時,revenue loss 下降) - Smart incident prediction - 如果我一直期望復原如何能在用戶沒有感知的狀況下復原 - 讓使用者沒注意到之前就修復 - Culture - Responsibility(責任) - 如何告訴大家blemeless跟Responsibilty是不同的 - Requirement - Expectation - Transprancy(透明) - 被Misunderstanding 蓋住了,SRE為了快速解決問題但是外面的人不清楚進度與狀況,用戶不清楚狀況 - 用戶期待管理,溝通多久可以復原 - Vision - Prevention(預防) - Prediction(預測管理) - Perception(感知) - 感知是超過prediction的,提早一步用戶完全沒有注意到的時候就修復 - 在使用者感覺之前調查 # Connection - From Chaotic to Stabilized - 主要透過Slack「Hey! system down」「Ok, checking...」 - keep checking , checking... - 無法找到系統崩壞的程度 - 使用者反饋系統掛了,但實際上是在直播間內無法收禮物(系統沒掛) - Critical path需要檢討可能跟使用者想像的不同 - 驗證很重要!!! - 用戶可以進入直播間、可以送禮聊天,其他無所謂 - 開播這件事情在peak hours出事時, 是'相對'不重要的 - 找到自己公司業務最重要的功能 - 工程師平常沒有玩自己的系統,開發者登進去發現自己在聊天室一直被罵 - 讓使用者知道發生什麼事情很重要,回饋速度加快 # Process - Efficient CI/CD - 上層的命令不能掛:但是不能掛的定義不清楚 - 用PreProd做流量測試,流量測試非常燒錢,有需要才做 - 引入sonarcloud,每次推code靜態檢查 # Process : Complete Incident SOP - no more 'checking', create new SOP system to report the current status. - what r u doing , how long will you take - Don't Break the expectation from stackholder - 15 min, 30 min and so on, the user keep waiting and disappointed again and again. - 從checking, checking 變成'15 分鐘就會好' '15 分鐘就會好' => 那倒不如你直接跟我說6小時..這樣我們就直接上公告. 第2次再rollback時, 就要直接說明了. - - 需要有一個PM去斷捨離做決策,需要多長時間 ## Smart issue reporiting service - 同一個issue三個團隊解,同時推上去打架 - 因為在slack tag所有人,優先序又調最高 - AI 根據歷史記錄指派,回覆Stackholder ## Process : Smart Incident Prediction Service - 高QPS時提前阻斷, 預測, 有點像是讓系統變成'部分可用' ## Culture: Promise to Our Customers - Blamelessness to Promise 需要知道Stackholder在乎什麼 - 意外總是會發生,但如何降低對商務的傷害 - Transprancy非常重要,因為使用者很討厭看到checking - Blamelessness = Stakeholder's Expectation + Transparency + User' Needs + Responsibility - So we can promise - 工程師總是負責任的,因為最終都是由工程師解決問題 ## Vision: From Passive to Proactive - Prevention - BigQuery - Prediction - Google Vertex AI - Smart incident Reporting (JIRA Reference) - Potrntial incident Dection - Partial Impact Controller - Perception - Google Gemini - Customer Perception - Feedback - Service Info - 系統在保護我們(開發者) - 從全面潰散的節點(關鍵瓶頸)改成部分失能(Partial impact) # What's the next of 17LIVE SRE? 17LIVE Stability Diagnosis Service 引入中醫的概念:望聞問切 - 望 Inspection Engine - Collect and input data from user feedback channels - 聞 Listening engine (觀察) - Understand users's issue/need - 直接用Gemini進去直播間看...主播永遠不在講人話..因為主播是一口氣對N個人說話, 都是斷句, 以前不行, 現在可以了 AI 工具的進步, 所以透過Gemini Summarize the information - Include SNS and social channel - 在LINE 在直播有人在抱怨, 系統都會知道 - 把comment / audio / video全部丟進去. - 問 Inquiring Engine (調查) - Activiely recognize system issues - 包括最近的產品發佈(變動),最可能有關聯 - 切 Palpation(觸診) Engine (處置) - Auto-alert - Make decision - Deep dive issues - Critical paths recovery - 用使用者比較不會生氣的方式溝通(我的機器沒壞,氣氣氣!) # Main contributions of 17LIVE SRE evolution - Users - 發現直播間講的會聽會改 - 信賴增加 - 改善與工程師的對立與不信任 - 增加忠誠 - Stakeholders - 透明度改善 - Engineers - 預算降低 - bug降低 - Company - 預算 - 收益 - 客戶導向開發 # From Data Driven to Data-informed SRE Services - Awareness - Insight - Vision ==以下聊天區== 17live現在還有在作直播嗎? >有呀, 台灣日本還有其他國家都有 >> 新聞現在很少看到17live的直播主 意外總是會發生

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully