8-1. Large-scale cluster management at Google with Borg

# 8-1. Large-scale cluster management at Google with Borg ## Introduction * Borg 是 google 內部用來維護 cluster 的 cluster manager，k8s 就是 borg 的開源 implement，他的功能包括開始、重啟、排程、監控各種 run 在 cluster 上的application。 * 使用 borg 有三大好處: * 隱藏資源調度與錯誤處理這兩件事情，讓 user 可以專注在開發上。 * 本身擁有高可用性跟可靠性，因此架構在他之上的應用程式也會有這兩個特性。 * 可以讓 user 很有效率的 run 各種不同 application 在 cluster 上。 ## The user perspective * borg 的 user 主要是 google 內部的軟體開發者及系統維護者。 ### The workload * borg將 job 分成兩類，production 與 non-production: * production: 基本上服務不能中斷，且須保證 latency 足夠低，例如 gmail、google docs。 * non-production: batch processing 的工作，相對時間沒那麼敏感的工作，可以等到 cluster 較空閒時再處理，像是產生每日報表等等。 ### Clusters and cells * 在同一個 cell 裡的所有 machine 都會在同一個 datacenter 中，且互相之間由高速的網路所連接。 * cell 是什麼可以參考下圖，我自己理解成整個 cluster。 ![image](https://hackmd.io/_uploads/S1FU_heHR.png) * cell 裡的 machine size 為 10k。 ### Jobs and tasks * borg 中的 job 包含: * name * owner * the number of tasks it has * 且 job 可以包含 constraints，也就是像 k8s affinity 那樣可以指定該 job 要 run 在哪些特定 node 上。 * 所有的 job 都不會 run 在 vm 上，因為要避免虛擬化所造成的效能損失。 * 在實際執行 task 時，除了使用預設的config，也可以指定該 task 所需要 resource，例如(CPU cores, RAM, disk space, disk access rate, TCP ports,2 etc.) * job 支援 declarative 的方式去定義，我想應該是類似 k8s 中的那些 yaml 檔。 * job schedule 的狀態機 ![image](https://hackmd.io/_uploads/SyZe2ngrA.png) * job 可以在執行時動態的變更其 config，新的 config 會在一定的時間之後被apply 到 job。 * 有些操作不行，例如變更執行的 binary 檔。 ### Allocs * *alloc* 指的是在某一台機器上預留給 borg 使用的資源，這些資源保持分配狀態，不論它們是否被使用。 * 其功能是 * 可以為未來的任務預留資源 * 將來自不同工作但高度相關的任務集中到同一台機器上。 * 例如: > a web server instance and an associated > log saver task that copies the server’s URL logs from > the local disk to a distributed file system * *alloc set* 是指一群 *alloc* 形成的集合，也就是多台機器上被預留的資源，可以提交多個 job 給 *alloc set* 運行。 ### Priority, quota, and admission control * priority: * 所有 job 會有一個 priority，borg 會依照 priority 去決定哪個 job 需要被先做。 * 優先度從高到低被分為 monitoring, production, batch, and best effort (also known as testing or free)。 * borg 允許 preempt，也就是高優先度的工作可以 kill 掉正在執行的低優先度工作，並且將資源搶過來。 * 為了避免循環搶占(A搶B，B再搶C)，borg 規定 monitoring 跟 production 這兩個 layer 不允許 preempt。 * quota: * quota 是指該 job 所需的最少資源(CPU, RAM, disk, etc.)數量，不滿足 quota 的 job 會被 reject，我覺得類似 k8s pod spec 中的 request。 * 可以設定各種資源在特定 priority 且在某一個時間區段最多可以使用多少(e.g., “20 TiB of RAM at prod priority from now until the end of July in cell xx”)。 * 通常 production 的 quota 會被指定成實際可用的資源總量，因此提交 production 工作的用戶可以期望其工作能夠順利運行。 * admission control: * borg 可以賦予某些 user 特殊權限 * 允許管理員刪除或修改單元中的任何工作。 * 允許用戶訪問受限的內核功能或 Borg 行為，例如在其工作上禁用資源估算。 ### Naming and monitoring * Naming * 為了要找到每個 task 實際上 run 在哪一台機器上，borg 維護一個 BNS(Borg name service) 的服務。 * 每個 task 會額外包含以下資訊: * cell name * job name * task number * borg 會將每個 task 使用的機器的 hostname 與 port 寫入 chubby 中的某個文件，利用 chubby 達到高可用及一致性。 > The BNS name also forms the basis of the task’s DNS name, so the fiftieth task in job **jfoo** owned by user **ubar** in cell **cc** would be reachable via 50.jfoo.ubar.cc.borg.google.com. * monitoring * borg 在幾乎每個 task 裡都 run 了一個 http server，利用這個 server 去監控該 task 的狀態。 * borg 會定時的去 ping 各個 task 的 health-check URL，以確定 task 的狀態，fail 的 task 將會被重啟。 * 其他數據則由專門的監控軟體去監控。 * 監控軟體好像是使用一個叫 **Sigma** 的軟體，該軟體除了擁有一般監控軟體有的基本功能之外，還可以對於 pending 的 task 給出為何 pending，可以幫助使用者 debug。 * borg 會將所有 event 及其相關訊息存在一個只讀的 Infrastore 中，通過 Dremel 提供交互式 SQL 類接口，讓 user 可以獲得其中的訊息。 ## Borg architecture * 一個 borg cell 包含以下的組件: * Borgmaster(負責排程、監控、與其他機器溝通) * Borglet(run 在 worker 上的程式，負責 handle 從 master 來的 rpc) * 整體架構與 k8s 差不多。 ### Borgmaster * 每個 Borgmaster 包含兩個 process: * main process * > The main Borgmaster process handles client RPCs that either mutate state (e.g., create job) or provide read-only access to data (e.g., lookup job). It also manages state machines for all of the objects in the system (machines, tasks, allocs, etc.), communicates with the Borglets, and offers a web UI as a backup to Sigma. * scheduler * Borgmaster 在每個 cell 中會有 5 個 replica，replica 間使用 paxos 達到資料的 consistency。 * Borgmaster 對於自己的資料會做 checkpoint。 * google 額外開發了一個叫 Fauxmaster 的程式，它可以讀取 Borgmaster 產生的 checkpoint file，並模擬 Borgmaster 的行為。 ### Scheduling * 在 borg 中，排程的粒度是到 task 的層級。 * task 在被 submit 後會先被存放在 borgmaster 的 persisten store 裡。 * schduler 會使用 round-rabin 的演算法從高優先度到低優先度的去排程每個 task。 * 在排程時會做這兩個步驟: * *feasibility checking*: 找出一組符合這個任務 constraint 的機器，讓 scheduler 不會將 task schedule 到資源不足的機器。 * > In feasibility checking, the scheduler finds a set of machines that meet the task’s constraints and also have enough “available” resources – which includes resources assigned to lower-priority tasks that can be evicted. * *scoring*: 對找出來的每台機器進行評分，選擇最高分的機器做該 task。 * > In scoring, the scheduler determines the “goodness” of each feasible machine. The score takes into account user-specified preferences, but is mostly driven by built-in criteria such as minimizing the number and priority of preempted tasks, picking machines that already have a copy of the task’s packages, spreading tasks across power and failure domains, and packing quality including putting a mix of high and low priority tasks onto a single machine to allow the high-priority ones to expand in a load spike. * 如果最終選擇的機器資源不足的話(可能是因為多個 task 平行的被處理，所以 *feasibility checking* 會通過)，borg 會將 run 在該機器上較低優先度的 task kill 掉，被 kill 掉的 task 會回到 pending。 * scheduler 的 data locality 只有做盡可能的將 task schedule 在包含該 task 所需的 package 上的機器。 ### Borglet * borglet 的功能: > The Borglet is a local Borg agent that is present on every machine in a cell. It starts and stops tasks; restarts them if they fail; manages local resources by manipulating OS kernel settings; rolls over debug logs; and reports the state of the machine to the Borgmaster and other monitoring systems. * borgmaster 每隔幾秒鐘會從 borglet 那拉取其狀態，borglet 每次都會回傳其全部的狀態，但為了 scalability， borgmaster 那會維護一個叫 *link shard* 的東西，borglet 回傳的結果經過 *link shard* 後會被轉換成和上一個狀態的差異之後才被 borgmaster 所儲存。 * borglet 如果與 borgmaster 失去聯繫，borgmaster 會將 run 在該台 machine 上的 task 都 reschedule。 ### Scalability * 使用三個技術保證 borg 的 scalability: * **Score caching**: cache 某個 task 在各個 machine 的 *feasibility checking* 與 *scoring* 的結果，在機器狀態改變或 task requirement 改變時，invalidate cache。 * **Equivalence classes**: 對於同質性的 task 給予一個 label，針對同一個 label 的 task 只做一次的 *feasibility checking* 與 *scoring*。 * **Relaxed randomization**: 不需要對所有 cell 中的 machine 做 *feasibility checking* 與 *scoring*，只需要 random sample 出一組足夠多的 machine，保證做完 *feasibility checking* 後的 machine 足夠多台就行。 ## Availability * 下圖說明了在一個 cell 中，task failure 是很常發生的。 ![image](https://hackmd.io/_uploads/BkuvlWmBC.png) * borg 採用了以下措施去保證 availability: > * automatically reschedules evicted tasks, on a new machine if necessary; > * reduces correlated failures by spreading tasks of a job across failure domains such as machines, racks, and power domains; > * limits the allowed rate of task disruptions and the number of tasks from a job that can be simultaneously down during maintenance activities such as OS or machine upgrades; > * uses declarative desired-state representations and idempotent mutating operations, so that a failed client can harmlessly resubmit any forgotten requests; > * rate-limits finding new places for tasks from machines that become unreachable, because it cannot distinguish between large-scale machine failure and a network partition; >* avoids repeating task::machine pairings that cause task or machine crashes; and >* recovers critical intermediate data written to local disk by repeatedly re-running a logsaver task (x2.4), even if the alloc it was attached to is terminated or moved to another machine. Users can set how long the system keeps trying; a few days is common. * 主要核心的思想就是讓 run 到一半的任務可以從某個 checkpoint 接續著 run，而不是像 MapReduce 那樣一但 task fail，就需要 reschedule 並重 run 整個任務。 * borg 利用以上的措施達到了 99.99% 的 availability。 ## Utilization ### Evaluation methodology * 選擇衡量 utilization 的指標是 *cell compaction*，意思是在完成同樣的一個 workload 底下，我最多可以把這個 cell 縮到多小。 > ***cell compaction***: given a workload, we found out how small a cell it could be fitted into by removing machines until the workload no longer fitted * 實驗使用 Borgmaster 那部分提到的 Fauxmaster 去進行模擬，其使用的資料與 production 完全一致，並且移除機器的方式是隨機的且不刪除 workload 中任何的 job。 * 為何使用 *cell compaction* ? > We believe cell compaction provides a fair, consistent way to compare scheduling policies, and it translates directly into a cost/benefit result. ### Cell sharing * 將 prod 與 non-prod 的 task 同時 run 在一個 cell 是有好處的，下圖證明了這點。 * ![image](https://hackmd.io/_uploads/HkX7yG7HR.png) * 原因: 通常 prod 的 task 會 hold 大量的 resource 為了應對 workload spike，borg 可以將這些暫時用不到的資源拿來跑 non-prod 的 task，等到真的需要用時再利用 preempt 的機制去將資源搶回來。這樣就可以將資源利用率最大化。 * 如果改成 prod 在某個 cell 執行，non-prod 在另個 cell 執行，這樣需要再額外配置 20~30% 的機器。 * google 針對不做 cell sharing 做了另一個實驗，實驗的內容是當 user 使用的 memory 大於 10TB 或 100TB 這兩個閾值時，就將其切分到一個新的 cell。 * ![image](https://hackmd.io/_uploads/rJt4MfmrC.png) * 根據實驗結果，這樣做需要 2~16 倍的多餘 cell 與 20~150% 的多餘機器。 * 但 cell sharing 會導致的問題將不同的 user 的 job 或不同類型的 task 在同一台機器 run 可能會導致 cpu 速度變慢，於是做了實驗證明與得到的相比，cpu 本身的降速是可以容忍的，且 cell sharing 不只是在 cpu 上有優勢，在 disk 與 memory 同樣也有。 ### Large cells * 下圖顯示將一個大的 cell 分成好幾個小的 cell會顯著的影響效能，因此傾向將資源分在一個 large cell 而不是好幾個小的 cell。 ![image](https://hackmd.io/_uploads/SJP-4zmr0.png) ### Fine-grained resource requests * 在 borg 中 cpu 是以毫核(A core is a processor hyperthread, normalized for performance across machine types)為單位分配，而記憶體與硬碟則是 byte 為單位。在下圖中記憶體與 cpu 沒有明顯的偏好點(適合綁在一起 allocate)存在。 * ![image](https://hackmd.io/_uploads/Hyp1DMmB0.png) * 有嘗試將 cpu 以 0.5 core 與 memory 為 1GB 為單位去做分配，結果在中位數的表現增加了30~50% 的 overhead。 * ![image](https://hackmd.io/_uploads/SktJtGmBA.png) ### Resource reclamation * 因為通常 task 在 run 的時候不會用到其所有要求的 resource，所以 borg 針對 task 有一個預測的機制，borgmaster 每隔幾秒會根據歷史數據預測 task 會消耗多少 resource，將其稱之為 task 的 reservation，borg 會使用 reservation 而不是 task 的 request 去 run 前面提到的 *feasibility checking*，以達到節省資源的目的，以上整個流程就叫做 Resource reclamation。 * 因為 prod 的 task 較為重要，所以其 reservation 永遠等於 limit，也就是不做 Resource reclamation。而 non-prod 的 task 則是在 task 開始的前 300 秒，reservation 會相等於其的 request。 * 如果 non-prod 使用超過 reservation 的資源，會直接被 kill 掉，就算其使用的資源沒有超過 limit 也是一樣。 * 以下圖來看，沒有 Resource reclamation 的話，需要額外配置 20% 的機器。 * ![image](https://hackmd.io/_uploads/rkka0fmr0.png) * 下面是額外的實驗結果: * ![image](https://hackmd.io/_uploads/HkUmJ7QrC.png) * ![image](https://hackmd.io/_uploads/H1IV1mmSR.png) ## Isolation * 在 borg 中，有 50% 以上的機器同時 run 9 個以上的 task，有 10% 的機器開了 4500 個thread run 了 25 個 task 以上，所以如何讓這些 task 不互相干擾變成一個很重要的議題。接下來會從 security 與 performance 的角度去探討 isolation。 ### Security isolation * 讓 user 可以使用 borgssh(類似ssh) 連入執行該 user task 的機器，borgssh 會利用 linux 原生的 *chroot* 與 *cgroup* 去做權限管控。 * 如果有 task 使用到非 google 開發的軟體時，borg 會使用一個 KVM process(VM) 去 run task，以達到確保 security 的作用。 ### Performance isolation * 早期的 borglet 對於資源隔離使用相對粗暴的解決方式，定期的去檢驗 task 所使用的 cpu、memory、disk 是否有超過限制。 * 後來改使用 linux kernel 自帶的 *cgroup*(我覺得可以想成是 docker 的 container)做為資源隔離的選擇，但一些低階的資源汙染還是存在，例如 memory bandwidth 與 cpu L3 cache 的汙染。 * 且在此為每個 task 新增一個額外的叫 *appclass* 的屬性，*appclass* 分為以下兩種: * *latency-sensitive* * 需要及時返回結果的 task，在 borglet 中會優先執行這一類的 task，為此甚至會讓當下在執行的 *batch* 任務暫停。 * *batch* * 不那麼要求回傳時間的 task。 :::info 這裡說的 *appclass* 與 schedule 那的 priority 是不太一樣的東西，前者是該 task 已經被 schedule 在某台機器上了，然後在該台機器上該 task 與其他在該機器上的 task 的 priority。而後者是該 task 在所有 task 的 priority。在粒度上面不太一樣。 ::: * 資源可以被分成兩種: * *compressible*: > * (e.g., CPU cycles, disk I/O bandwidth) that are rate-based and can be reclaimed from a task by decreasing its quality of service without killing it. * *non-compressible*: > (e.g., memory, disk space) which generally cannot be reclaimed without killing the task. * 如果一台機器用完 *compressible resource*，borglet 會限制其不可再用更多，但如果一個機器用完 *non-compressible resource*，borglet 會將 task 由低到高 kill 掉，直到剩餘的量滿足需求。 * *latency-sensitive* 的 task 可以獨享一整個 cpu core，而 *batch* 的 task 則是不能選擇自己 run 在哪些 core 上。 ## Related work 參考[這篇](https://blog.csdn.net/fly910905/article/details/120840292)。 ## Lessons and future work ### Lessons learned: the bad * **Jobs are restrictive as the only grouping mechanism for tasks**: * 在 borg 中缺乏管理多個 job 的手段，且通常一個完整的 service 會包含多個 job，這導致了在 borg 中管理 service 會相對困難。 * 因此 kubernetes 改成在 pod 上使用 label 來解決這一問題，user 建立的 service 透過 pod 的 label 去找到屬於該 service 的 pod。 * **One IP address per machine complicates things**: * 同一台機器使用一個 ip 聽起來是一件很直覺的事情，但它在 borg 上帶來了許多問題: * borg 必須將 port 視為一種資源並管理 port。 * task 必須宣告其需要使用多少 port。 * borg 必須解決 port isolation。 * naming system 與 RPC 必須知道 port。 * 在 kubernetes 裡，採用了不同的設計，所有 pod 與 service 都有自己的 ip，這就避免了上述提到的這些問題。 * **Optimizing for power users at the expense of casual ones**: * borg 為了一些比較特定的用戶，將 API 的 spec 設定的非常複雜(其中有些 spec 高達230 個參數)，導致一般的使用者難以使用，且也導致 API 難以維護及修改。 * 解決方式是在 borg 上再加一層自動化的工具，幫助一般使用者使用合適的設定來使用 borg。 ### Lessons learned: the good * **Allocs are useful**: * allocs 讚 :+1: * kubernetes 中相似的功能是 pod。 * **Cluster management is more than task management**: * service 的維護比一般 task 的維護更加重要，而被維護在 borg 上的 application 同時也享受到了 naming 與 load balancing。。 * kubernetes 中相似的功能是 service。 * **Introspection is vital**: * 展示部分 debug 的訊息給 user，讓 user 可以自己找到問題並解決。並且提供多種分析工具給使用者讓其可以快速發現異常事件。 * 在 kubernetes 中也同樣保留了這個特性，使用者有各種工具可以去看到整個 cluster 的資訊。 > cAdvisor for resource monitoring, and log aggregation based on Elasticsearch/Kibana and Fluentd. * **The master is the kernel of a distributed system**: * master node 需要被精心設計，一開始 borgmaster 裡的 scheduler 與 sigma(UI)是被設計在一起的，但後來為了 scalability 就將其分開成兩個不同的 process。最後在開發過程中陸陸續續加入各種不同的功能，最終才變成論文中的 borgmaster。 * kubernetes 則更進一步，它的 master 只是一個 api server，則所有其他關於cluster manage 的 component 則被拆解成一個個的 micro service 與 api server 互動，例子有 replication controller、node controller 等等。 ### Conclusion > Virtually all of Google’s cluster workloads have switched to use Borg over the past decade. We continue to evolve it, and have applied the lessons we learned from it to Kubernetes. ## Reference * [論文本文](https://dl.acm.org/doi/pdf/10.1145/2741948.2741964) * [論文筆記](https://wsfdl.com/architecture/2022/05/02/borg-2015.html)，較注重 k8s 與 borg 之間的比較。 * [論文翻譯](https://blog.csdn.net/fly910905/article/details/120840292)