進階探討 - 安全且無需人為干預的自動化部署

## 進階探討 - 安全且無需人為干預的自動化部署原文出處: [link]( https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/?nc1=h_ls) 作者: [Clare Liguori](https://aws.amazon.com/builders-library/authors/clare-liguori/) is Principal Software Engineer at AWS --- ## 目的 - 學習世界上一流的部署是怎麼做的 - 啟發思考和想法 - 檢視自己有不足的地方 > 有點類似導讀原文，重點是啟發大家的想法，但不會講全部細節 --- ## 為何選擇這篇文章？ - 涵蓋面夠廣，有助於思考整個軟體生命週期，從開發、測試、部署這些階段的觀念和作法 - 文章內容相關職責從開發、QA、維運等不同角色都必需參與，才能完成整個流程，也是 DevOps 精神的實踐 - 可以比對 DevOps 三步工作法 - 暢流、回饋、持續學習 --- ## 前提背景 Clare 目前在 AWS 已經 7 年了，所以在文章開頭提到她 7 年前面試 AWS 時所提出的二個問題 - How often do you deploy to production? - How much time I would have to spend managing and watching deployments as a developer at Amazon? ---- 以她當時前公司的經驗 - 一年僅有二次主版本發佈 - 小版本或 bug fix 時，需要不斷的盯 logs 看生產環境有無問題 - 出問題時要儘快手動回復上個版本 > 相信也是大家有過的經驗 ---- 而當時 AWS 面試官回答 - 在一天當中，部署「多次」到生產環境 - 通常不需要開發者去盯部署過程和 logs，因為 pipelines 已經搞定這些事了 --- ## 這麼神奇？到底怎麼做到的？反思: 這在以往的經驗裡，根本就不可能 --- ## 何謂「安全」的部署 **這裡的安全指可信賴、且風險最小化的部署** - 減少人為操作犯錯 (自動化) - 足夠的測試和檢測 (Alpha、Beta、Gamma 階段各種測試) - 部署策略 (One box、Feature flags、Canary Deployment) - 監控各種 log 並警報 (high-severity alarm) --- ## Pipeline 四大步驟 Source -> Build -> Test -> Prod(Deploy) - Source、Build 請看原文圖片 - Source 範圍幾乎函蓋所有開發的類型 - Build 階段即完成 Unit Test，Test 階段有另外的測試 --- ## Microservice pipeline - application code pipeline - infrastructure pipeline - OS patching pipeline - configuration/feature flags pipeline - operator tools pipeline 這些 pipeline 指向同一個 Microservice，為了加快速度所以彼此獨立，但因為是同一個服務，也有相同的安全檢查機制 --- ## Pipeline sources - 每一種 Source 都有獨立的 Pipeline - 比較特別的是 Configuration Source Pipeline，文中提到 like API rate limit increases and feature flags - 如果 feature flags 導致 production 出問題時，也會 rollback (Feature flag 撰寫是種策略和挑戰) --- ## Code review - 最後的人工檢視，並且需要 approve，再來都是全自動 - Pipelne 將 rollback 所有未 Code Review 部署 (GitHub 或 BitBucket 上就是 PR approve) - Code review 比較好的作法，會有檢視清單 (參考原文) --- ## Build and unit tests - Typically, unit tests mock (simulate) all their API calls to dependencies, such as other AWS services - Interactions with “live” non-mocked dependencies are tested later in the pipeline in integration tests - exercise edge cases like unexpected errors returned from API calls and ensure graceful error handling in the code. --- ### Test deployments in pre-production environments - Alpha、Beta - validate that the latest code functions as expected by running `functional API tests` and `end-to-end integration tests` ---- - Gamma - production-like (類似 staging) - including the same deployment configuration, the same monitoring and alarms, and the same continuous canary testing as production - deployed in multiple AWS Regions to catch any potential impact from regional differences --- ### Integration tests (pre-production) - These tests exercise the full stack end-to-end by calling real APIs running on real infrastructure in each pre-production stage - 反思：以前經驗的作法都是環境盡量隔離，因為怕 production 被搞壞，但就是因為 proudction 太脆弱才會有這層顧慮，例如 API 無法重覆大量呼叫，DB Schema 設計 or Custer 架構問題而不足以承受測試資料等 ---- - While unit tests run against mocked dependencies, integration tests run against a pre-production system that calls real dependencies, validating the assumptions of the mocks ---- - Integration tests run both positive and negative test cases - invalid input to an API and checking that an “invalid input” error is returned as expected - Some pipelines also run a short load test in a pre-production stage to ensure that the latest changes don’t cause any latency or throughput regressions at real load levels. ---- Unit Test 因為 mock 了許多相依的套件或服務，故另開發 Unit Test 的另一層思考面向是：「這是一種假設相依套件或服務會如何回應的測試」，而到了整合測試則可以實際驗證 Unit Test 的假設 --- ### Backward compatibility and one-box testing (pre-production) - Backward compatibility - we need to detect whether the latest code writes data in a format that the current code can’t parse - 承上，解法參考 [Two-phase deployment technique (level 400)](https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-deployments/?nc1=h_ls) ---- - One Box - 定義: a single virtual machine, single container, or a small percentage of Lambda function invocations - This one-box deployment leaves the rest of the gamma environment deployed with the current code for some period of time, such as 30 minutes or one hour. (bake time) ---- - 續 One Box - Traffic doesn’t have to be specially driven to the one box. - the one box receives ten percent of the gamma traffic (參考原文圖片) ---- - 續 One Box - The one-box deployment monitors canary test success rates and service metrics to detect any impact - For example, a team’s microservice A in gamma calls the same team’s microservice B in gamma, but it calls the production endpoint for Amazon S3. (交叉測試預生產環境和生產環境) --- ### Production deployments (重頭戲) - 目前 AWS 有 25 regions 和 81 AZ - 部署目標避免「同時」在多 regions 和多 AZ 下引發問題 (negative impact) - 在 Gamma 就是為了用一樣的方法模擬 Production 部署，故 Production 也是用 one-box 的方式 - 補充：AWS 早期也層為了加快測試速度，在預生產環境使用與生產環境不一樣的部署方式，但導致預生產環境無法抓出問題 --- ## Staggered deployments - 部署需考量平衡安全與速度 (可能需要數週才能把更新到全球用戶) - 附註，就像有名的 CAP 理論(針對分散式資料庫) - 使用「waves」的方式以達到部署風險和速度之間取得良好的平衡 - wave 會在每個 region 中編排並運作 - 如圖示 wave 1 one box 會推進到 wave 1 Prod，此時也會有從 gamma 來的部署推進 wave 1 one box，所以不會等到一次性大的變更才做部署　 ---- - wave 1 one box 和 wave 1 Prod 是最先的兩個 wave，同時也是建立信心的重要階段 - wave 1 one box 部署在「單一 AZ」，先接大約 10% 流量觀察是否有問題，等待 bake time 後，wave 2 Prod 部署到整個單一 AZ ---- - 以此僅慎的方法再執行跨 region 的部署，如圖中，wave 3 是「並發」部署 3 個 regions， wave 4 並發部署 12 regions，然後剩餘的 regions 在 wave 5 中部署 - 套用至自身團隊的話，可以考量部署服務的重要性、影響範圍來決定 wave 1 ~ wave 5 中部署的數量 --- ## One-box and rolling deployments - 通常 one-box 提供 10% 流量，先行觀察少量 request 狀況 - one-box 後，進行 Rolling deployments - 每次 Rolling Update 最多 33% 替換更新，以確保線上有足夠的 box 提供服務，線上負載不至於過重 - 承上，有興趣參考 [這篇](https://9incloud.com/devops/cicd/ecs-rolling-update-zero-downtime) ，也就是 `minimumHealthyPercent` 需要設定 66 --- ## Metrics monitoring and auto-rollback - 此機制為部署後不需要人工觀測的關鍵 - Each microservice in each Region typically has a high-severity alarm that triggers on thresholds (參考原文 check list) - high-severity aggregate rollback alarm (參考原文 check list) ---- - monitor the high-severity alarms for the team’s other microservices to determine when to roll back - 代表微服務間的多多少少都有相關性，需要監控其他團隊微服務的 Metrics alarms - 再進一步的理解，不同微服務之間的 monitor or Metrics alarms 機制是跨團隊可共用或是標準 ---- - one-box rollback alarm (參考原文 check list) - 關於 P90、P99 latecy 可參考一下 [這篇](https://docs.vmware.com/en/VMware-Tanzu-Service-Mesh/services/slos-with-tsm/GUID-20F1B2A2-7789-44DE-B193-352F8C4BAE23.html) --- ## Bake time 部署後有些問題可能需要經過一段時間後，才會發生以前經驗的例子 - 文章中例子：低流量低負載時不會發生問題，經過一段時間高流量高負載時才會出現問題 - disk io 愈來愈慢，以前經驗如 request 連線出不去，造成 log 狂寫，導致 disk io 非常高 - 因某種問題大量寫入 logs or DB，導致超出 db connection maxiumum --- ## 補充 - 到底需要寫多少測試才足夠？可以參考[以持續交付加快腳步](https://aws.amazon.com/tw/builders-library/going-faster-with-continuous-delivery/?nc1=h_ls) 「多少測試為足夠？」這個問題的答案見仁見智。這需要團隊瞭解本身操作所處的環境。為了因應此一情形，我們採取另一項領導原則「主事者精神」。此項原則旨在做長遠思考，不為短期結果犧牲長期的價值。Amazon 的軟體團隊對於測試懷有高標準，為此付出大量心力，畢竟只要對產品負責，表示也為產品內任何缺陷帶來的後果負責 ---- 若問題對客戶有所影響，即會由小型單一執行緒軟體團隊成員處理該問題，並且即時修復。增加執行速度與回應生產問題之間存有緊張關係，隨之激發團隊進行充分的測試然而，若對測試過度投資，卻可能由於他人的腳步更快而導致我們無法成功。我們始終在尋求不致於對業務形成阻礙之下改善軟體發佈程序。 --- ## 補充 - SLA 協議 - 比較有規模的 SAAS 都會提供 SLA，如 [Travis CI](https://knapsackpro.com/ci_comparisons/travis-ci/vs/netlify-build)、[gitpod](https://www.gitpod-staging.com/enterprise/)(enterprise 版本) - 有時企業與企業合作，是可能會要求對方提供 SLA - 可以思考整個部署流程需要做到什麼程度，才能提供 SLA 後面幾個 9？ AWS SLA 現況可參考 [這裡](https://www.ernestchiang.com/en/notes/aws/product-list-sla/?fbclid=IwAR2C_c37kSYKefL9bkfG1kncCeEfUaQtAhVtRlhP45XnrM2Pr1enK0YPuHw) --- ## 總結 - DevOps 是一種精神，這篇顯示 AWS 把這種精神發揮到很極致 - 從文章來看，雖然是講「部署」，但從 developer、operator、QA、on call engineer 等等都有協同參與，並利用工具自動化所有流程，把人工監測的部分降至最低 - 整個流程也建立了團隊的高度信心，把開發人員從人工部署和人工監測的工作中解放出來，更能專注建立更有核心價值的商業模式 --- # 金句分享 “Good intentions never work, you need good mechanisms to make anything happen” — Jeff Bezos.