Talking About DevOps & Observability

--- title: Talking About DevOps & Observability tags: Sinyi, Observability, Monitoring, Log description: Retrospect and Prospect 2022 slideOptions: controls: true allottedMinutes: 5 --- # Talking About DevOps & Observability.. > Nathan Lu <div style="text-align: right"> 02/23/2022 </div> --- # Outline - 運維盲區Operational Blindness - 監控Monitoring - 紀錄Logging - 指標Metrics - Observability --- # 前言主要針對公司內部現況做分享. 站在DevOps的右側, 偏運維角度看事情 Observability的部分只是介紹 --- ## Operational Blindness ![](https://i.imgur.com/1dnit6x.jpg =600x400) ---- ### War Story ![](https://i.imgur.com/POdkXcJ.png) Note: 某天中午, 運維組內暴發出警報通知, 一堆訊息湧現. 大家也跳了起來, 發現網站癱瘓了, 外部監控近幾次健康檢查都沒過, 而觸發了警報. ---- ![](https://i.imgur.com/k8DCN9x.png =700x500) Note: 發現網站癱瘓了 ---- ### Black-Box Monitoring ![](https://i.imgur.com/CknOiks.png) Note: 登入主機監控系統, 發現網站的系統指標正常, 記憶體也還有, CPU沒到100%, Disk跟IO也還在容許範圍內, 資料庫也活著 ---- ![](https://i.imgur.com/G82g0nD.png) Note: 不知道原因只好把問題擴大到開發團隊, 因為主機與網站服務的進程本身看起來還算正常 ---- ![](https://i.imgur.com/Vm783kK.png =600x400) Note: 開發團隊也不確定要看什麼, 且它們沒有直接登入正式環境主機的權限; 也不會讓所有人都有權限進入主機, 管理成本高與資安風險高 ---- ![](https://i.imgur.com/uep3689.png) Note: 開發團隊只好與運維團隊一起工作並且只能發號施令最終在資料庫上發現, 一堆慢查詢跟一堆連線處於CLOSE_WAIT狀態在Web服務上也看到有幾乎一樣多數量的進程推測很可能Web服務已經達到了最大處理容量並停止響應新的request(例如healthcheck) 最後把db的query session都砍掉, Web服務上的請求隊列也清空, 就正常了 ---- ![](https://i.imgur.com/wI4sTlx.png) Note: 查找這些問題的根源, 浪費了很多時間; 當問題被升級時, 被波及的人也沒足夠權限查看對找問題有用的主機資訊與服務資訊. 最終還是得依賴運維團隊.且在這種情境下, 每一分鐘的無法服務或停機時間都是要錢的運維團隊之所以只能把問題擴大, 是因為他們能看到的就幾乎只有主機的監控指標, ---- ### Question - Dev/Ops團隊的分離, 這樣的組織與溝通方式, 有間接成本的產生 Note: 傳統的結構上, 有一條明顯的紅線把開發與運維的職責劃分開. 運維團隊負責主機與基礎設施, 使得應用程式的交付成為可能. ---- ![](https://i.imgur.com/X3kUYGy.png) Note: Developer與Operations如果組織結構與合作方式上職責分的太開, 往往會像拔河一般, 分散了力道. 相互拉扯中, 復原時間就流逝了. ---- ![](https://i.imgur.com/EW4950D.png) Note: 最可怕的還是, 使用者對平台提供的服務品質沒信心了, 跳去競品平台了 ---- ## DevOps ![](https://i.imgur.com/B4a94yh.png) ###### [What is SRE | Tasks and Responsibilities of an SRE | SRE vs DevOps](https://youtu.be/OnK4IKgLl24) Note: Developer與Operations不應該相互拉扯在拔河, 而分散了眾人的力道. ---- ![](https://i.imgur.com/mQNibxr.png) ---- ![](https://i.imgur.com/IuNaQsC.png) ###### [非同步系統的服務水準保證淺談非同步系統的 SLO 設計-91APP](https://www.91app.tech/static/97979df17dd25408c24c20ae74e27155/SLO-design-of-the-asynchronous-system-Andrew.pdf) Note: SLI（service level indicator）指的是指標，例如：QPS，TPS，Duration，準確性，延遲，性能等不是所有的metric都視為SLI，選擇儘可能少的SLI，但這些SLI卻能說明服務是否穩定，可靠。 SLO（service level objective）:服務等級目標指的是一段時間內的目標，例如：1個月內的QPS 99.99%，響應時間<10ms等等 SLO是一組值的範圍，這個值就是由SLI定義的服務級別數值。自然的SLO定義就是，某SLI在正常情況下需要小於某值或者處於某個大小值之間。選擇一個合適的SLO並不是一件容易的事情，當然一開始並不需要設定好這個範圍 SLA （service level agreement）:服務等級協議指的是整個協議，協議的內容包含了SLI，SLO以及恢復的方式和時間等等一系列所構成的協議 ---- - 基於時間的可用性 > 基於時間的可用性 = 系統正常運行時間 / (系統正常運行時間 + 停機時間) > SLO 99.95%, 以一年來看, 不可用佔了4小時22分鐘 > SLO 99.99%, 以一年來看, 不可用佔了52分鐘 ---- - 基於次數合計的可用性 > 基於合計的可用性 = 成功請求數 / 請求數總和 > SLO 99.99%, 如果一天要接受2.5M個請求, 每天錯誤個數應<250個 ###### [事件處理案例](https://study-area.sre.tw/Incidents/) ---- - 基於延遲 > SLO 99% 前台每秒User訪問延遲 < 300ms ###### [[架構師的修練] #2, SLO - 如何確保服務水準?](https://columns.chicken-house.net/2021/06/04/slo/) --- ## Monitoring ![](https://i.imgur.com/6rgwWMz.png) ---- ## Types of Monitoring ![](https://i.imgur.com/YUJ8QG7.png) ###### [Multi-Cloud Monitoring](https://www.meshcloud.io/2020/08/28/multi-cloud-monitoring-a-cloud-security-essential/) ---- ## Monitoring Layers ![](https://i.imgur.com/vImhE4B.png =500x450) --- ## Logging Deal with discrete events - Application debug or error messages - Audit-trail events - Request-specific metadata - Specific events Note: 應用程式錯誤訊息、稽核事件、HTTP請求事件 ---- ## Log Monitoring ![](https://i.imgur.com/J9PjBD0.png) - Troubleshooting - Monitor ---- - Clearly log level - Good log message ([Structured Log](https://ithelp.ithome.com.tw/articles/10277678)) - Log aggregation Note: Show Loki 單行Log, 又是JSON格式, 針對各種類型的服務定義出固定的metadata key ---- ## Defect of Log Monitor - Low value density ``` [INFO] .... initializing... [INFO] ... request from xxx.xxx.xxx.xxx user_id xxxxxx ``` Note: 有價值的資訊密度太低, 很多都是第一行那樣的無價值資訊 ---- 這部影片主要講了很多 Log Cetralize and Monitor的好處跟Log要怎清楚的表達 [Logging in the age of Microservices and the Cloud](https://www.youtube.com/watch?v=zdfhgtcm4uk) ---- - Write log into RDBMS - log如果併發寫入很高, 寫入的資料庫又是RDBMS, 併發事務量不會很高 - 就算buffer pool加大, 也是有極限的, 也不建議 - log字段很多種類, 索引選擇困難 - DB真出事了, 想登入看log也沒法, 搞不好還沒寫進去 - 很容易就是壓死DB的那根樹枝 - 讓DB回歸, 業務狀態與資料的存儲與存取吧 --- ## Metrics 4 Golden Signals - Latency : time to serivce a request - Traffic : requests/second - Error : error rate of request - Saturation : fullness of a service ###### [SRE-book](https://sre.google/sre-book/monitoring-distributed-systems/) ###### [Observability: Metric, Logging, and Tracing, Oh My!](https://www.youtube.com/watch?v=ZVKrN1RLetI) Note: 反映用戶體驗，衡量系統核心性能。如：系統的處理時間，作業計算系統的作業完成時間等。反映系統的服務量。如：請求數，發出和接收的網絡封包大小等。幫助發現和定位故障和問題。如：錯誤總量、調用服務失敗率等。反映系統的飽和度和負載。如：系統佔用的內存、作業隊列的長度等。 ---- - [Metrics](https://prometheus.io/docs/concepts/data_model/#notation) for Prometheus - metric name and label sets ``` <metric name>{<label name>=<label value>, ...} ``` ``` # TYPE http_requests_total counter |--------------------------- Metric ----------------------------|-timestamp -|-value-| |--- metric name --|------------------ labelsets ---------------| http_requests_total{code="200",handler="prometheus",method="get"} 721 ``` Note: label sets代表了這個metric name下的一個維度,可以有多個維度方便做聚合操作 ---- ## Metric Types - Counters (rate) - Gauges (value) - Distribution - Histogram (heatmap) - Summary ###### [Observability of Distributed Systems](https://www.youtube.com/watch?v=SoZZzB-yTOk&list=WL&index=115&t=77s) Note: counter只增不減, 通常用來取得request總量, 任務完成的數量, 錯誤發生次數, 或者計算某段時間內的rate變化率; 能事前透過壓力與負載測試能取得可預期的上限, 做監控與警告; 查詢當前系統中，訪問量前10的HTTP URL. gauge即時變化情況, 隨著時間不斷變化, 通常用來記錄cpu, mem用量, coroutine數量, pool usage, 併發請求數...; 透過計算樣本的線性回歸模型, 對數據的變化趨勢進行預測. Histogram會對觀測數據取樣，然後將觀測數據放入有數值上界的桶中，並記錄各桶中數據的個數，所有數據的個數和數據數值總和, 請求時延, 各種有樣本數據;用來區分是平均的慢還是長尾的慢,快速了解監控樣本的分佈情況 Summary 與 Histogram 類似，會對觀測數據進行取樣，得到數據的個數和總和。此外，還會取一個滑動窗口，計算窗口內樣本數據的分位數。 ---- ## Archetecture ![](https://i.imgur.com/MHtKRgQ.png =x550) --- ## Observability ![](https://i.imgur.com/Z0x2oFn.jpg) ###### [The Observability Pipeline](https://www.slideshare.net/TylerTreat/the-observability-pipeline) Note: Monitoring tells you whether system works, observability lets you ask why it's not working ---- ![](https://i.imgur.com/dNaiuoB.png) ---- ## Pilliars of Obersvability ![](https://i.imgur.com/kf6Xd1i.jpg) --- # News [【企業SRE實例：新加坡星展集團】頂尖數位銀行如何再進化，SRE轉型是變身科技公司的關鍵](https://www.ithome.com.tw/news/144120) [【臺灣SRE實例：17Live集團】多功能型SRE化身內部信心來源，天天成為開發團隊後盾](https://www.ithome.com.tw/news/144122) [【臺灣SRE實例：Line臺灣】如何確保Line服務天天不中斷，專責SRE扮演開發與維運的橋樑](https://www.ithome.com.tw/news/144121) [Line臺灣百億筆遙測數據的可觀察性平臺架構大公開](https://www.ithome.com.tw/news/149317) [臺灣大型企業如何上手SRE，Google建議先做這4件事](https://www.ithome.com.tw/news/144119) --- ## Reference [SRE-BOOK](https://sre.google/sre-book/table-of-contents/) [Operations Anti-Patterns, DevOps Solutions](https://books.google.com.tw/books?id=g3kFEAAAQBAJ&dq=Operations+Anti-Patterns,+DevOps+Solutions&hl=zh-TW&source=gbs_navlinks_s) [Logging and Log Management](https://books.google.com.tw/books?id=Rf8M_X_YTUoC&dq=logging+and+log+management&hl=zh-TW&source=gbs_navlinks_s) [阿里雲-日誌服務](https://help.aliyun.com/document_detail/48869.html) [Grafana Documentation](https://grafana.com/docs/grafana/latest/) [Prometheus](https://prometheus.io/docs/prometheus/latest/getting_started/) [Loki Documentation](https://grafana.com/docs/loki/latest/) [FluentBit Documentation](https://docs.fluentbit.io/manual/) --- ### Thank you! You can find me on ![](https://member.ithome.com.tw/avatars/120846?s=ithelp) - [Blog](https://tedmax100.github.io/) - [IT邦](https://ithelp.ithome.com.tw/users/20104930/ironman)