Nathan.Lu
  • NEW!
    NEW!  Connect Ideas Across Notes
    Save time and share insights. With Paragraph Citation, you can quote others’ work with source info built in. If someone cites your note, you’ll see a card showing where it’s used—bringing notes closer together.
    Got it
      • Create new note
      • Create a note from template
        • Sharing URL Link copied
        • /edit
        • View mode
          • Edit mode
          • View mode
          • Book mode
          • Slide mode
          Edit mode View mode Book mode Slide mode
        • Customize slides
        • Note Permission
        • Read
          • Only me
          • Signed-in users
          • Everyone
          Only me Signed-in users Everyone
        • Write
          • Only me
          • Signed-in users
          • Everyone
          Only me Signed-in users Everyone
        • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invite by email
        Invitee

        This note has no invitees

      • Publish Note

        Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

        Your note will be visible on your profile and discoverable by anyone.
        Your note is now live.
        This note is visible on your profile and discoverable online.
        Everyone on the web can find and read all notes of this public team.

        Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Explore these features while you wait
        Complete general settings
        Bookmark and like published notes
        Write a few more notes
        Complete general settings
        Write a few more notes
        See published notes
        Unpublish note
        Please check the box to agree to the Community Guidelines.
        View profile
      • Commenting
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
        • Everyone
      • Suggest edit
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
      • Emoji Reply
      • Enable
      • Versions and GitHub Sync
      • Note settings
      • Note Insights New
      • Engagement control
      • Make a copy
      • Transfer ownership
      • Delete this note
      • Save as template
      • Insert from template
      • Import from
        • Dropbox
        • Google Drive
        • Gist
        • Clipboard
      • Export to
        • Dropbox
        • Google Drive
        • Gist
      • Download
        • Markdown
        • HTML
        • Raw HTML
    Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
    Create Create new note Create a note from template
    Menu
    Options
    Engagement control Make a copy Transfer ownership Delete this note
    Import from
    Dropbox Google Drive Gist Clipboard
    Export to
    Dropbox Google Drive Gist
    Download
    Markdown HTML Raw HTML
    Back
    Sharing URL Link copied
    /edit
    View mode
    • Edit mode
    • View mode
    • Book mode
    • Slide mode
    Edit mode View mode Book mode Slide mode
    Customize slides
    Note Permission
    Read
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Write
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    2
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- title: Talking About DevOps & Observability tags: Sinyi, Observability, Monitoring, Log description: Retrospect and Prospect 2022 slideOptions: controls: true allottedMinutes: 5 --- # Talking About DevOps & Observability.. > Nathan Lu <div style="text-align: right"> 02/23/2022 </div> --- # Outline - 運維盲區Operational Blindness - 監控Monitoring - 紀錄Logging - 指標Metrics - Observability --- # 前言 主要針對公司內部現況做分享. 站在DevOps的右側, 偏運維角度看事情 Observability的部分只是介紹 --- ## Operational Blindness ![](https://i.imgur.com/1dnit6x.jpg =600x400) ---- ### War Story ![](https://i.imgur.com/POdkXcJ.png) Note: 某天中午, 運維組內暴發出警報通知, 一堆訊息湧現. 大家也跳了起來, 發現網站癱瘓了, 外部監控近幾次健康檢查都沒過, 而觸發了警報. ---- ![](https://i.imgur.com/k8DCN9x.png =700x500) Note: 發現網站癱瘓了 ---- ### Black-Box Monitoring ![](https://i.imgur.com/CknOiks.png) Note: 登入主機監控系統, 發現網站的系統指標正常, 記憶體也還有, CPU沒到100%, Disk跟IO也還在容許範圍內, 資料庫也活著 ---- ![](https://i.imgur.com/G82g0nD.png) Note: 不知道原因只好把問題擴大到開發團隊, 因為主機與網站服務的進程本身看起來還算正常 ---- ![](https://i.imgur.com/Vm783kK.png =600x400) Note: 開發團隊也不確定要看什麼, 且它們沒有直接登入正式環境主機的權限; 也不會讓所有人都有權限進入主機, 管理成本高與資安風險高 ---- ![](https://i.imgur.com/uep3689.png) Note: 開發團隊只好與運維團隊一起工作並且只能發號施令 最終在資料庫上發現, 一堆慢查詢跟一堆連線處於CLOSE_WAIT狀態 在Web服務上也看到有幾乎一樣多數量的進程 推測很可能Web服務已經達到了最大處理容量並停止響應新的request(例如healthcheck) 最後把db的query session都砍掉, Web服務上的請求隊列也清空, 就正常了 ---- ![](https://i.imgur.com/wI4sTlx.png) Note: 查找這些問題的根源, 浪費了很多時間; 當問題被升級時, 被波及的人也沒足夠權限查看對找問題有用的主機資訊與服務資訊. 最終還是得依賴運維團隊.且在這種情境下, 每一分鐘的無法服務或停機時間都是要錢的 運維團隊之所以只能把問題擴大, 是因為他們能看到的就幾乎只有主機的監控指標, ---- ### Question - Dev/Ops團隊的分離, 這樣的組織與溝通方式, 有間接成本的產生 Note: 傳統的結構上, 有一條明顯的紅線把開發與運維的職責劃分開. 運維團隊負責主機與基礎設施, 使得應用程式的交付成為可能. ---- ![](https://i.imgur.com/X3kUYGy.png) Note: Developer與Operations如果組織結構與合作方式上職責分的太開, 往往會像拔河一般, 分散了力道. 相互拉扯中, 復原時間就流逝了. ---- ![](https://i.imgur.com/EW4950D.png) Note: 最可怕的還是, 使用者對平台提供的服務品質沒信心了, 跳去競品平台了 ---- ## DevOps ![](https://i.imgur.com/B4a94yh.png) ###### [What is SRE | Tasks and Responsibilities of an SRE | SRE vs DevOps](https://youtu.be/OnK4IKgLl24) Note: Developer與Operations不應該相互拉扯在拔河, 而分散了眾人的力道. ---- ![](https://i.imgur.com/mQNibxr.png) ---- ![](https://i.imgur.com/IuNaQsC.png) ###### [非同步系統的服務水準保證 淺談非同步系統的 SLO 設計-91APP](https://www.91app.tech/static/97979df17dd25408c24c20ae74e27155/SLO-design-of-the-asynchronous-system-Andrew.pdf) Note: SLI(service level indicator) 指的是指標,例如:QPS,TPS,Duration,準確性,延遲,性能等 不是所有的metric都視為SLI,選擇儘可能少的SLI,但這些SLI卻能說明服務是否穩定,可靠。 SLO(service level objective):服務等級目標 指的是一段時間內的目標,例如:1個月內的QPS 99.99%,響應時間<10ms等等 SLO是一組值的範圍,這個值就是由SLI定義的服務級別數值。自然的SLO定義就是,某SLI在正常情況下需要小於某值或者處於某個大小值之間。 選擇一個合適的SLO並不是一件容易的事情,當然一開始並不需要設定好這個範圍 SLA (service level agreement):服務等級協議 指的是整個協議,協議的內容包含了SLI,SLO以及恢復的方式和時間等等一系列所構成的協議 ---- - 基於時間的可用性 > 基於時間的可用性 = 系統正常運行時間 / (系統正常運行時間 + 停機時間) > SLO 99.95%, 以一年來看, 不可用佔了4小時22分鐘 > SLO 99.99%, 以一年來看, 不可用佔了52分鐘 ---- - 基於次數合計的可用性 > 基於合計的可用性 = 成功請求數 / 請求數總和 > SLO 99.99%, 如果一天要接受2.5M個請求, 每天錯誤個數應<250個 ###### [事件處理案例](https://study-area.sre.tw/Incidents/) ---- - 基於延遲 > SLO 99% 前台每秒User訪問延遲 < 300ms ###### [[架構師的修練] #2, SLO - 如何確保服務水準?](https://columns.chicken-house.net/2021/06/04/slo/) --- ## Monitoring ![](https://i.imgur.com/6rgwWMz.png) ---- ## Types of Monitoring ![](https://i.imgur.com/YUJ8QG7.png) ###### [Multi-Cloud Monitoring](https://www.meshcloud.io/2020/08/28/multi-cloud-monitoring-a-cloud-security-essential/) ---- ## Monitoring Layers ![](https://i.imgur.com/vImhE4B.png =500x450) --- ## Logging Deal with discrete events - Application debug or error messages - Audit-trail events - Request-specific metadata - Specific events Note: 應用程式錯誤訊息、稽核事件、HTTP請求事件 ---- ## Log Monitoring ![](https://i.imgur.com/J9PjBD0.png) - Troubleshooting - Monitor ---- - Clearly log level - Good log message ([Structured Log](https://ithelp.ithome.com.tw/articles/10277678)) - Log aggregation Note: Show Loki 單行Log, 又是JSON格式, 針對各種類型的服務定義出固定的metadata key ---- ## Defect of Log Monitor - Low value density ``` [INFO] .... initializing... [INFO] ... request from xxx.xxx.xxx.xxx user_id xxxxxx ``` Note: 有價值的資訊密度太低, 很多都是第一行那樣的無價值資訊 ---- 這部影片主要講了很多 Log Cetralize and Monitor的好處 跟Log要怎清楚的表達 [Logging in the age of Microservices and the Cloud](https://www.youtube.com/watch?v=zdfhgtcm4uk) ---- - Write log into RDBMS - log如果併發寫入很高, 寫入的資料庫又是RDBMS, 併發事務量不會很高 - 就算buffer pool加大, 也是有極限的, 也不建議 - log字段很多種類, 索引選擇困難 - DB真出事了, 想登入看log也沒法, 搞不好還沒寫進去 - 很容易就是壓死DB的那根樹枝 - 讓DB回歸, 業務狀態與資料的存儲與存取吧 --- ## Metrics 4 Golden Signals - Latency : time to serivce a request - Traffic : requests/second - Error : error rate of request - Saturation : fullness of a service ###### [SRE-book](https://sre.google/sre-book/monitoring-distributed-systems/) ###### [Observability: Metric, Logging, and Tracing, Oh My!](https://www.youtube.com/watch?v=ZVKrN1RLetI) Note: 反映用戶體驗,衡量系統核心性能。如:系統的處理時間,作業計算系統的作業完成時間等。 反映系統的服務量。如:請求數,發出和接收的網絡封包大小等。 幫助發現和定位故障和問題。如:錯誤總量、調用服務失敗率等。 反映系統的飽和度和負載。如:系統佔用的內存、作業隊列的長度等。 ---- - [Metrics](https://prometheus.io/docs/concepts/data_model/#notation) for Prometheus - metric name and label sets ``` <metric name>{<label name>=<label value>, ...} ``` ``` # TYPE http_requests_total counter |--------------------------- Metric ----------------------------|-timestamp -|-value-| |--- metric name --|------------------ labelsets ---------------| http_requests_total{code="200",handler="prometheus",method="get"} 721 ``` Note: label sets代表了這個metric name下的一個維度,可以有多個維度方便做聚合操作 ---- ## Metric Types - Counters (rate) - Gauges (value) - Distribution - Histogram (heatmap) - Summary ###### [Observability of Distributed Systems](https://www.youtube.com/watch?v=SoZZzB-yTOk&list=WL&index=115&t=77s) Note: counter只增不減, 通常用來取得request總量, 任務完成的數量, 錯誤發生次數, 或者計算某段時間內的rate變化率; 能事前透過壓力與負載測試能取得可預期的上限, 做監控與警告; 查詢當前系統中,訪問量前10的HTTP URL. gauge即時變化情況, 隨著時間不斷變化, 通常用來記錄cpu, mem用量, coroutine數量, pool usage, 併發請求數...; 透過計算樣本的線性回歸模型, 對數據的變化趨勢進行預測. Histogram會對觀測數據取樣,然後將觀測數據放入有數值上界的桶中,並記錄各桶中數據的個數,所有數據的個數和數據數值總和, 請求時延, 各種有樣本數據;用來區分是平均的慢還是長尾的慢,快速了解監控樣本的分佈情況 Summary 與 Histogram 類似,會對觀測數據進行取樣,得到數據的個數和總和。此外,還會取一個滑動窗口,計算窗口內樣本數據的分位數。 ---- ## Archetecture ![](https://i.imgur.com/MHtKRgQ.png =x550) --- ## Observability ![](https://i.imgur.com/Z0x2oFn.jpg) ###### [The Observability Pipeline](https://www.slideshare.net/TylerTreat/the-observability-pipeline) Note: Monitoring tells you whether system works, observability lets you ask why it's not working ---- ![](https://i.imgur.com/dNaiuoB.png) ---- ## Pilliars of Obersvability ![](https://i.imgur.com/kf6Xd1i.jpg) --- # News [【企業SRE實例:新加坡星展集團】頂尖數位銀行如何再進化,SRE轉型是變身科技公司的關鍵](https://www.ithome.com.tw/news/144120) [【臺灣SRE實例:17Live集團】多功能型SRE化身內部信心來源,天天成為開發團隊後盾](https://www.ithome.com.tw/news/144122) [【臺灣SRE實例:Line臺灣】如何確保Line服務天天不中斷,專責SRE扮演開發與維運的橋樑](https://www.ithome.com.tw/news/144121) [Line臺灣百億筆遙測數據的可觀察性平臺架構大公開](https://www.ithome.com.tw/news/149317) [臺灣大型企業如何上手SRE,Google建議先做這4件事](https://www.ithome.com.tw/news/144119) --- ## Reference [SRE-BOOK](https://sre.google/sre-book/table-of-contents/) [Operations Anti-Patterns, DevOps Solutions](https://books.google.com.tw/books?id=g3kFEAAAQBAJ&dq=Operations+Anti-Patterns,+DevOps+Solutions&hl=zh-TW&source=gbs_navlinks_s) [Logging and Log Management](https://books.google.com.tw/books?id=Rf8M_X_YTUoC&dq=logging+and+log+management&hl=zh-TW&source=gbs_navlinks_s) [阿里雲-日誌服務](https://help.aliyun.com/document_detail/48869.html) [Grafana Documentation](https://grafana.com/docs/grafana/latest/) [Prometheus](https://prometheus.io/docs/prometheus/latest/getting_started/) [Loki Documentation](https://grafana.com/docs/loki/latest/) [FluentBit Documentation](https://docs.fluentbit.io/manual/) --- ### Thank you! You can find me on ![](https://member.ithome.com.tw/avatars/120846?s=ithelp) - [Blog](https://tedmax100.github.io/) - [IT邦](https://ithelp.ithome.com.tw/users/20104930/ironman)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Google Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully