Cheng Lin Yu
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# SRE Book Reading 11/16 (二) Slide: [SRE Study Notes - Opening, CH2,3,4](https://www.slideshare.net/rickhwang/sre-study-notes-opening-ch234) # Ch2 Data Center ## 資料中心 - 硬體 - switch (Jupiter) 是 Google 自己做的 - 網路用 Clos 連結 ## Borg 物理服務器的系統軟體 - controle 機器的 lifecycle - 是 k8s 前身 - 負責執行 job - borg 自己的 DNS: BNS - 會修理自己的機器 - ## 儲存 - 提供簡單的儲存叢集服務 - 概念類似 HDFS (Hadoop Distribute File System) data center 的距離跟 traffic 都靠軟體來管理 ## 其他 data center 裡的系統軟體 - Lock Service: Cubby - Borgmon,自己的監控系統,第八還第九章有深入介紹,蠻硬的 - # Ch3 擁抱風險 管理的概念 - 怎麼度量一個服務的風險 - 設施的風險容忍度 - 整章節的重點是錯誤預算,讓 SRE 有一個可以犯錯的額度 - 可靠性成本提升並非線性,而可能是 100 倍 - 機會成本 - 計算資源的成本 - 明確的把維運與業務風險區分 ## 度量服務的風險 Measuring Servcie Risk - 難以被度量的:用戶不滿&信任&組織形象口碑&媒體報導 - 還是得拿這些比較抽象的事情跟業務端溝通 - 什麼叫好?什麼叫不好?如何 qualify - 風險承受力指標:白話文就是能接受停機多久? - 不同的業務,不同的服務產品也都不會一樣 - 例如業務單位認為一封簡訊都不能掉 - 可用性還蠻麻煩的,你的定義是什麼? - 一年可用性 99.99% 最多可以停 52 分鐘 - 一天 2.5 M request,一天要少於 250 個錯誤 - 不見得每個 request 都是平等的 - 有的 API request 負荷很重,有些很輕量 ### SLA 的服務承諾 - 99.5% 一年最多能有 1.852 天 - 99.999% 一年只能當機五分鐘,如果你的老闆喊出這數字,要審慎評估 ## 服務的風險容忍度 - 你的客戶(消費者)的容忍度是什麼? - 基礎設施的容忍度(基本上都由 Cloud infra 做掉了) ## 辨別消費者服務的風險容忍度 - 消費者與客戶的「期望」 - 沒有收入因素其實就可以談 - 用了一個第三方服務,搞不好就容易出問題 - 有償服務?免費服務? - 市場上競爭對手的水平如何? B2B or B2C ... 不同產業對象從不同角度切入 ### 故障的類型 - Case A 網站部分圖檔無法顯示 - Case B 網站使用者 A 看到使用者 B - Gmail 的 VP 說:如果我們發生 Case B 問題,就會立刻停止服務營運,因為影響範圍太大 * iThome 新聞: [Google儲存SRE團隊負責人第一手經驗大公開](https://www.ithome.com.tw/news/105366) - 接受 regular maintain 的服務中斷 ### 成本 - 建構和維運多一個零的可用性,在服務承諾的同時要花多少 cost - 多一個 9 成本要增加多少? - 要從 99.9% 到 99.99% 差了 10 倍:停機時間上限 525 min -> 52 mins - 額外的收入可以抵銷付出的成本? - 例如增加 0.09% 成本只要 $900 ### 基礎設施風險容忍度 - Google 以 Bigtable 為例 - 通常跟 data 有關的都特別重要,關注的是吞吐量而非可靠性 ### 錯誤預算的建構過程 - 例如 - 季度維運目標是 99.999% - 錯誤預算:0.001% - 某問題用了 x %,相當於佔用了 n % 的錯誤預算 ## 關鍵點 - 風險管理的成本很高 - 強調 100% 不是正確的目標 - 錯誤預算的調整可以激勵團隊,強調共同責任,跟研發團隊一起看,而非只有 SRE # Ch4 SLO ## SLI (indicator)服務指標:具體的量化指標 - 例如 CPU usage 是一個指標 - latency - error rate - QPS & Request Per Second - AWS S3 的 duerability (持久性)是 11 個 9,遺失的可能性時間 ## SLO (objective)服務目標 - 例如範圍是什麼,CPU > 80% 需要 alert - 數字標的目標是多少,限制在哪裡 - latency < 500ms - error rate < 1 RPS - 吞吐量 > 100 RPS - 例如 AWS 的 EBS 可用性是 99.95% - ## SLA (agreement)服務協議 - 把上面的 SLI + SLO 加起來 - 描述較多的法律條款 - 後果:例如罰款 退款 - 如果沒有定義後果 / 因果關係,那就是在討論 SLO,而非 SLA - SRE 通常不會參與 SLA 的規劃 ,但參與 SLI 的制定以及 SLO 的量測 另一套企業界的說法: ITIL ### 故事:Chubby 計劃性停機 - 故障頻率很低,導致其他人都以為他不會掛 - 但是 SRE 知道他出問題會影響很多服務 - 解法: - SRE 保證達到一定的 SLO,但不會大幅度超越 - 強迫每個服務負責人/開發者都要知道如何面對分散式系統的天生缺陷,不會完美 分散式架構是不完美 ## SLI 的實踐應用 - 不需要把所有指標都定義為 SLI - 代表性的健康指標四到五個就足夠了 - 常見的歸類: - 所有人都關心的,例如可用性、延遲、吞吐量 - 正確性等等 ## 指標的搜集 - borgmon - prometheus - ELK, Grafana, CloudWatch Log ## Aggregation - 蒐集指標,需要有原始度量的資料 - 量測之前需要知道有什麼要量測? - 應該每秒採樣?還是每分鐘? - AWS 都是每分鐘 - 我自己的 Sample rate 是十秒,丟到 CloudWatch - 有些問題是發生在一瞬間,你為了追這些問題可能得把採樣嚴謹到每秒 - SRE 傾向分析百分比的數據,而不是平均值 - - 長尾效應比平均數更有特點 - 偏好統計學 .__. ## 指標標準化 - 標準化常見的 SLI, 避免每次重新評估 - e.g. frequently measurements 十秒做一次 ## SLO - 如果同時有批次處理用戶,可以是: - 批次:95% Set RPC < 1s - 即時:99% Set RPC + RPC loading < 1k) < 10m (複合條件) - 如果 loading 太多就不會被要求到所有 request 要小於 < 10ms - 也就是說只容許 1% > 10ms - 有條件式的定義 - 不要只看眼前,從全局的出發 - 再次強調 SLO 100% 是不合理也是高成本的 - 不完美也很美 -- 我覺得很重要 - 隨著時間了解系統之後,定期審查是否需要調整,可以刪就刪 - 91APP 案例分享:你看到一個不需要立刻處理的 alert,就貼到 excel,每週大家投票決定刪掉哪些 - John 案例分享:數千台機器,每天壞掉的比修好的多,五個人團隊也修不完,到後來就麻痺了 - 大多買二三手機器,去標五年淘汰的公司釋出的機器 - 跑運算程式後 CPU 沒 100% 表示有問題 - 後來的解決方案是定期透過網路重開機器,~~重開治百病~~ - 不一定發現的時候要馬上發 alert,可以 retry 個三次再發 alert - TTLB, Time to last byte 案例分享 - CDN - micro service API 對接 - 灑出去的 API 很難收回來,參數只能加不能減 - AWS athena (類似 BigQuery 的服務) :https://aws.amazon.com/tw/athena/ - ## SLA 實踐中的應用 - 要跟業務單位溝通,這個很難做到,100 分是不可能的 - SLA 是用來評量後果條款的標準,最常用的就是 - 可用性 99.99% / year - 持久性 99.999999999% / year - SLA 由多個 SLO 組成 ## Summary - SLA 是用來評量後果的標準 - 常用的 availability durability ## Q & A - S3 的 11 個 9 是代表什麼? - SLA - SLO - SLI ## 延伸討論 Q & A: - 什麼服務會要求 Time to last byte(TTLB)? - 像是搜尋服務、CDN - 什麼服務會要求 Time to first byte(TTFB)? - 網站的 request 就會是要要求 TTFB - 愈快接受到請求愈好 - 如何與業務端定義速度的快慢? - 網站前端的案例:找 Google/Facebook 提供的案例標準當作指標 - 快慢是相對的,提出比較的對象比較容易做調整 - ipcam 的網速,後來用 fps 直接顯示讓客戶知道他到底是屬於快還是慢的速度,就不會再亂吵慢 - Log 的處理 - 把 log 處理,結構化之後,封存在 S3 上,之後再透過 redshift 下查詢彙整 - Monitor 的方式

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully