PostgreSQL 大量 conflicts

# PostgreSQL 大量 conflicts 最近在排查 standby postgresql 上會有許多 error: ``` FATAL: terminating connection due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` 原因是 standby server 在 apply WAL 時，因為某種原因卡住，而造成 WAL 超過 `max_standby_streaming_delay` 沒有被 apply 發出異常，造成 conflicts # 原本的認知 1. 在 master 上有人 delete/update 某個 row 2. master 上沒有其他 query 用個這個 row，這個 row 變成 dead tuple ，因此 vacuum 時移除這個 row 3. 在 standby server 上的一個 query select 到這個 row ( query 還沒結束) 4. 通過 WAL ，standby server 也將這個 row delete/update 5. 通過 WAL standby server 進行 vacuum 6. vacuum 時發現這個 row 還被 `3` 的 query select 到，因此無法 vacuum 7. 如果因此超過 `max_standby_streaming_delay` 則發生 old snapshot conflict # 假說在 vacuum 過程中 - dead tuple 上會有紀錄一個 xid (transaction ID)，只要 database 上最老的 xid 比該 dead tuple 的 xid 還老，就不會 vacuum 該 tuple，反之則可以 vacuum。 - 原本我以為是跟 table 比較，但因為是跟 database 上最老的 xid 相比，因此所有 database 中的 query 都會影響到單獨 table 的vacuum。狀況如下： 1. 在 master server 上有人 delete/update tableA 某個 row (xid 假設為 5) 2. 在 master server 上最老的 xid 為 6，因為 5 比 6 還老，因此判斷沒有 query 用到上個版本的這個 row，可以對 tableA vacuum 移除這個 row 3. 在 standby server 上有一個 long running query 去 select tableB (xid = 100) 4. 通過 WAL ，standby server 也將這個 row delete/update (xid = 101) 5. 通過 WAL standby server 對 tableA 進行 vacuum ，此時發現在 standby 上最老的 xid 是 100 ，比要移除的 row (101) 還要老，因此無法 vacuum 6. 如果因此超過 `max_standby_streaming_delay` 則發生 old snapshot conflict # 實驗證明目的: 證明一個 long running 去 query A table，會 block B table 的 vacuum ``` PS: 由於要在本機用 begin 一個 transaction 去模擬 production 上的一個 long running query ，因此在本機端將 isolation level 由 default 的 read committed 改成 repeatable read ``` 1. 先在 local 開一台 master 和一台 slave ，創造一個 test db 以及兩個 table ![](https://hackmd.io/_uploads/rJKQX9Q83.png) 2. 在 table `roles` 和 `profiles` 中各插入一筆資料 ![](https://hackmd.io/_uploads/HJCYX5mU2.png) 3. 在 slave 端開啟一個 long running transaction，並 select `roles` 的 data ![](https://hackmd.io/_uploads/H16Rm9m8h.png) 4. 在 master 端 update `profiles` 並 vacuum ![](https://hackmd.io/_uploads/HJMXV57Un.png) 5. 在 slave 上經過 `max_standby_streaming_delay` 之後，這個 transaction 會碰到 conflict![](https://hackmd.io/_uploads/SJJu45Q83.png) 由此可知，即使 vacuum 的是 `profiles` 這個 table，也會被 `roles` 的 long running query 給 block 住。 # 排解 - 先不考慮 `hot_standby_feedback = on` 避免影響 master 效能 - 只要 standby 上常常有 long running query，就會造成 replication lag 和 conflicts error，因此目標要先減少 long running query - master 如果有 table 很常 vacuum ，則嘗試減少頻率