Cloud migration：Why ? How ? What happened ?

tags: `DevOpsDays Taipei 2018` `9/11` `13:30~14:10` `Track A`

歡迎來到 DevOps Days 2018 共筆

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

共筆入口：https://hackmd.io/c/DevOpsDays2018
手機版請點選上方按鈕展開議程列表。

在大會遇到任何問題都可以在下方的問題回報區中留言
大會問題與建議回報區

請從這裡開始

去年把 AWS 搬到 GCP
- 2017/04 討論要搬
- 2017/10 開始搬
- 2018/05 production 搬完

about 17media

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

app weekly release
server daily release 30 commits/ day
Code Review: Phabricator, CircleCI (auto lint + unit test + e2e test)
Master Branch: CircleCI tests & generates docker image
Deploy: Slack -> Jenkins (generate task definition & update service) -> ECS

chatops?

在aws 時的架構

Golang

QPS 高峰 10k/s

why migrating the cloud

Better docker support (GKE)
- 當時 GCP 有 k8s / AWS 還沒有
Cost saving
- GCP 比較便宜 (省20%)
Analytics
- GCP 的 ML 服務比較好
Geographic
- 雲在美西，大部分的客戶都在亞洲 (日本、台灣、香港、馬來西亞)

台灣有主機真的差很多
真的搬到台灣了 XD

latency 在晚上會飄.

How did you prepare `to reduce risks?`

搬遷過程的 risk
新環境的 risk

Risk of the new cloud

Logic issue
- Payment gateway
  忘記註冊新的雲的 IP(into the white list) ，可能會有問題
- Unit Test
- End to end test
- At least 1 test for each service
Performance issue
- Stress test database
- Synthetic traffice - extend from e2e test
  - 針對預想好的場景做測試
    - 一個人開播很多人看
    - 很多人同時開播
    - 開播後送一堆禮物
  - 就是一大堆 e2e test
- Real traffic - GoReplay
  - 把線上的流量，複製一份到新的雲上
  - 簡訊認證會送兩封，後來才發現是 mirror 同樣的流量到新的雲，導致又送了一封，所以做這件事需要注意 side effect
Data issue
- Use open-sourced dump/restore/sync library
  - 流量太大，導致 sync 的速度跟不上
- Seek for help from consultant
  - 只能找原廠來改 sync library 來滿足…

Reduce Risk of migration process

Complicated migration == High risk
Offline migation => Simple but high risk
Accident during official migration

Plan A: 1-time downtime migration

Step1. Cut-off traffic in AWS
Step2. Step Dump & restore database
Step3. Deploy containers / VM
Step4. 把流量打開
需要 4 小時
失敗了需要再 4 小時 rollback 回舊的雲
要縮短時間，需要用 sync database 的方式
- 兩個雲差太遠，sync 速度太慢

分段進行

Step1. 先搬到 GCP Oregon
Step2 再搬到 GCP Taiwan

同一個 cloud provider 內的 DB 搬遷有原廠協助

Plan B: Many online migrations

因為先搬到 Oregon 所以延遲很低，可以分段搬

Database Migration

先搬 mongodb
再搬 mysql

跟錢有關的晚點搬

VM + Redis = Container migration

因為 vm/container 會對 redis 做大量存取，latency需要低，所以需要一起搬

online migration

兩邊的 cloud DB sync
所有的設定都在 Etcd 上
會有 10 秒鐘 application 跟 db 沒有連結，因為這 10 秒要讓 db sync 完成

Reduce Risk of migration process

Accident during official migration
- Runbook
  - 把一步一步要做的步驟寫出來
  - 練習了 10 次
  - 也要練習 rollback
- Dryrun
  - 每次結束都會寫檢討報告(來回一次就要8小時)
- Pilot
  - 熟的人不做，不熟的人做，熟的人在旁邊看，這樣才知道 runbook 是不是有問題

What happened on that day

插曲1: Migrations 往後延一天
- jenkins -> Terraform -> S3
  - Jenkins 連不上 S3 .. 無法還原 Infra 環境
- 隔天早上他自己就好了…
插曲2: MongoDB Migration 延後開始
- 5/2行程：
  - 4:00 集合準備
  - 5:00 MongoDB migration
- 結果發現 DB sync 還沒跟上… (還差兩小時才能 sync 完…)
- 還好 6 點的時候追上了

5/3 行程
3:00 集合準備
3:45 MySQL migration
4:20 VM migration
4:30 Redis + container migration

插曲3: 當天晚上差點再搬回來
- 會有機率性 502… (3~5% 會回 502…)
- 最後發現是 k8s 的 internal DNS 有 read limit ..(kubedns ? or coredns?)
  - workaround 鎖死 ip (推測做法 host table)

上半場結束，下半場是將 GCP Oregon 搬到 GCP Taiwan，對這個有興趣的人歡迎加入
講者

請問上面的下半場，是指加入17參加下半場嗎?

想了解

場外聊天室，歡迎在下方喇賽

記得AWS 現在在國內比較活躍的代理商中有一家是昕奇他們的主要代表性客戶就是17.
不知道他們的看法會是什麼

Q.講者有說為什麼要搬嗎 ?

看到這問題為什麼要搬?讓我想八卦一下，前段時間GOOGLE從AWS挖了一狗票的人去GCP
我沒聽到，不知道是不是耳朵臨時背了>.<
沒聽到，好像沒講到？
Region 問題 (US West -> Taiwan) && GCP 比 AWS 便宜多了
看來$$+ Quality兩個面向都有助益

他們QPS 好高 (10k/sec) wow..
GoReplay –> 複製真實流量(透過Go來實現)

VM + Redis => 因為每個 request 需要存取大量 cache …
請問上面要怎麼搬才不會有 downtime 阿…? (換 cloud provider 的前提下…)

一般cloud migration 要練習3次以上，但他們練習了10次..

只有體驗不到沒有真的沒downtime吧XD

回樓上, user 沒發現的outage就不是outage..

所以要確定目標啦只要用戶感覺不到就算了

必須說, 他們的工程師願意配合，真的不簡單, 一般工程師對於這類事情的耐性都超級超級低。都會說，
一定沒問題，然後就出包了=口=

MongoDB sync 沒跟上指的是…? AWS + GCP 中的 Mongo DB cluster 嗎…?

上面的沒跟上應該是說GCP上的MongoDB資料還落後AWS上的MongoDB (講者分享的資訊是05:00時還落後兩小時的資料量)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Cloud migration：Why ? How ? What happened ?

tags: DevOpsDays Taipei 2018 9/11 13:30~14:10 Track A

about 17media

在aws 時的架構

why migrating the cloud

How did you prepare to reduce risks?

Risk of the new cloud

Reduce Risk of migration process

Plan A: 1-time downtime migration

Plan B: Many online migrations

Database Migration

VM + Redis = Container migration

online migration

Reduce Risk of migration process

What happened on that day

tags: `DevOpsDays Taipei 2018` `9/11` `13:30~14:10` `Track A`

How did you prepare `to reduce risks?`