owned this note changed a year ago
Linked with GitHub

From Observability to Observability Driven Development - 董淳吉 (Marcus Tung)

歡迎來到 DevOpsDay Taipei 2024 共筆

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

共筆入口:https://hackmd.io/@DevOpsDay/2024
手機版請點選上方 按鈕展開議程列表。

議程介紹

填寫議程滿意度問卷|回饋建言給辛苦的講者

作者補充資訊

共筆從這開始

tags: DevOpsDays Taipei 2024

服務突然在週末下班時間crashed, 這時候會怎麼做?

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

人 * 架構 * 基礎建設 * 流程 * 雲端 = 系統複雜度

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

300 多位高階管理者,50%有觀測性的60%縮短當機時間90%覺得有策略性價值

可觀測性演進的歷史

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

基於時間點的不同, 可觀測性也隨著時間演進

o11y 1.0 to 2.0

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

o11y 1.0 : 請求打到server 把經過路徑遙測出來:是慢的?壞的?

目的用於瞭解系統狀態, 簡短MTTR

三種基本遙測資料

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • Metrics, Trace, Log
  • Is Promble?, Where, What

Metrics

image

  • 外部: SLA
  • 內部: SLO SLI SLS Error bufget

建立 metrics 的目的是為了滿足對外的承諾.

Distributed Trace

image

追蹤的目的用來知道請求的節點路徑

Structured Log (結構化日誌)

image

  • 透過結構化讓機器讀,會有更多細節和統計
  • 通過標準化的結構來記錄資料以及便於頗析.
  • 結構化log
  • Application Logs
  • Security Logs
  • System Log
  • Audit log

問題排除流程

image

Alert ->. 查看dashboard -> 查詢 Adhoc -> ??? -> 查看追蹤資料 -> ???
哭啊 太快

621760

o11y 1.0 延伸問題:

  • 技術堆疊複雜度提高
  • 遙測數據暴增
  • 造成成本問題

蒐集這麼多資料, 不一定具備價值, 能達成成本效益.

o11y 2.0

O11y signals 進化, 新增以下遙測資料類型

  • profiles
  • dumps

CNCF (雲端原生運算基金會)出白皮書

Profiling

深入應用程式得知cpu, mem stack, gc等情況

Shift Left

image

過往關注的是發布後的維運(Operate)情況.

開發流程是左移的重點

左移到哪?左移到 CI/CD, 左移到 coding

  1. 通過對程式碼行為進行假設開發
  2. 使用signals 紀錄

Deploy stage

CI/CD

  • pipeline 一定要穩
  • 透過工具檢查 pipeline 各 stage 健康狀況

可觀測性驅動開發 ODD

image

Plan -> design -> develop -> test
目的

  1. 建立有效的反饋機制
  2. 打破開發和維運之間的隔閡
  3. 培養數據驅動的決策文化

RD 主要面對是已知的未知
RD 培養數據決策化
image

ODD framework

定義 KPI -> 設計階段時考慮可觀測性 -> 蒐集遙測資料 -> 建立即時反饋機制 -> 持續優化與調整

Goal :
確保在開發週期時就具備良好的可觀測性

ODD 衡量指標 (有點抽象)

監控指標(系統健康)

  • R.E.D.
  • U.S.E.
  • Latency, traffic, erros, saturation (google 4 golden signals)

系統架構:外部服務有? 關鍵服務是? 服務之間關係是?
識別關鍵服務, 外部服務, 釐清依賴關係, 識別重要情境

系統框架: MBMP(metrics-based process management)

Goal : SLA, SLOs, SLIs

ODD 衡量指標,以AI智能客服為例

監控指標:
系統架構:
系統框架: (看投影片)
Goal : 回答是

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

o11y 3.0

image

3.0 A.B.C + OTel
AI : 很大的挑戰
Big Data :
OpenTelemetry : 各大廠商一起訂的,目前訂到四個信號
Cloud : 實際框架運行

小結

621784


共筆聊天室:

Excel 數據師

有投影片嗎?

正在補圖片連結中,整理完會放在 這裡 喔 謝謝

Select a repo