Post-Mortem: 20 August Bridge Outage

# Post-Mortem: 20 August Bridge Outage ## What Happened? __Started:__ Around 3:25pm UTC on 19 August a user in discord reported having trouble with the xDai Bridge. They were [trying to bridge Dai from Ethereum](https://etherscan.io/tx/0xecf407d6ad5ffba517c8c3589ea599488e577990d5a408114c0da288c44980f8) to xDai on Gnosis Chain and hadn't recieved their funds in their target wallet address. It was determined that they weren't the only user who were still waiting on funds. Some transactions were able to go through, while some others were stuck. __Recovered:__ Around 4:40pm UTC on 20 August users reported transactions had gone through and that they were recieving the appropriate amount of xDai on the Home Network (Gnosis Chain). ## Technical Explanation During the outage, only 3/6 validators were online and availible to affirm transactions. However, 4/6 confirmations are requied for a transaction to be relayed. The following validators were identified to be offline/be having limited uptime: | Organization | Validator Address | Suspected Cause of Downtime | | ------------ | ----------------- | --------------------------- | | CowSwap | 0x587c0d02b40822f15f05301d87c16f6a08aaddde | Validator ran out of memory | | Protofire | 0x4d1c96b9a49c4469a0b720a22b74b034eddfe051 | Validator ran out of funds required to submit transactions | | Giveth | 0xc073C8E5ED9Aa11CF6776C69b3e13b259Ba9F506 | Validator seems to be having [ongoing issues](#giveth-downtime) with keeping 100% uptime | ### Additional Issues Discovered: * __Many failed `executeAffirmation()` calls:__ Seems that when some validators are slow to sign (or are offline alltogether, as was the case during the downtime), the faster ones try to sign again and get an error for duplicated affirmations. It seems that the Gnosis Dao and Gnosis Safe validators are pretty fast, which could explain why they frequently get that error. During the downtime, the same validator would try to affirm the same stuck transaction many times. Failed transactions can also be expected behavior in the case where validators all try to sign the transaction, and non-needed signatures are reverted. <details> <summary>Click to view some example transactions with duplicate affirmations </summary> * https://blockscout.com/xdai/mainnet/tx/0xc73df8386023621c6a5bcdcb540ad8dac63433b8c87244940259baa46f611c7e * https://blockscout.com/xdai/mainnet/tx/0xd0ae540f9acd12837983def4fab22b3b43a4f55485fd4d2a8c2a4f86782bd778 * https://blockscout.com/xdai/mainnet/tx/0xa52fbf5fdb5e4e475819cddf6e4335419f0d5d11b2c5c497093a1bd2fb3801e0 * https://blockscout.com/xdai/mainnet/tx/0xcef4185d34ec1d2ca54ea086acef00657c1b229780db4b2998f4c865afeee5b4 * https://blockscout.com/xdai/mainnet/tx/0xe56c2aa2b57000c86058bad667d78c97ccaa7f86f1c3683165e4eaca9e384602 </details> * __Many failed `submitSignature()` calls:__ seems top be a similar case as above: some validators are too slow to sign and the faster ones try and sign again. Failed transactions can also be expected behavior in the case where validators all try to sign the transaction, and non-needed signatures are reverted. In that case, there will be 4 successful `submitSignature`'s with duplicate messages (indicating the validators are signing the same txn), followed by a series of failures (the non-needed signatures). See below: <details> <summary>Click to View Details</summary> ![](https://i.imgur.com/EpAmHkX.png) All of those transactions have the same `message` param: `0x53c46a9705bbcc76866ede57a795af7f3eaa363700000000000000000000000000000000000000000000006f21770aa26ec80000cad9b92acf799234de13d173675d0c492805e090dd21d0cab75a2d2ad181cd914aa42145aa6ebf72e164c9bbc74fbd3788045016` For Reference: * https://blockscout.com/xdai/mainnet/tx/0x2194b69ac95d384b25bf4676b186bf1cd93077f5891b9092e9e40469bb4e4379 * https://blockscout.com/xdai/mainnet/tx/0x293316cf88b18b67fce2232213d4d11d32b72e70285e3857b0a0d8c9b761ff30 * https://blockscout.com/xdai/mainnet/tx/0xc497a5660c9089c88f0fbd1e4a646319e322a128aff466bc86944dcb13858da8 * https://blockscout.com/xdai/mainnet/tx/0x7acedbf22cb1ba702b534e3839f3e4dd7c0648b98ed7f97bbeed328cee9dab87 * https://blockscout.com/xdai/mainnet/tx/0x6237ed549c37c5e0726f23ee9d64e5905af145f79a1a421484f213cd6b48055c * https://blockscout.com/xdai/mainnet/tx/0xb8e00dc8a8a9a73e378047c76754d6d6cec3c1aafd165b100ee488d1b8b3278e </details> ### Giveth Downtime Upon further inspection, it seems that the Giveth validator is experiencing limited uptime. The Giveth validator is submitting far less affirmations than the other validators as illustrated by the following visualizations: ![](https://i.imgur.com/QScH2ad.png) ![](https://i.imgur.com/M0uVfTk.png) * https://dune.com/queries/1197177 ## Recommendations To avoid this issue from happening again, there are several possible actions to take: ### Increase Size of Validator Set Adding validators could increase redundancy, and therefore the reliability of the bridges. The more validators in the set, the lower the odds are that 50% of them will be down at any given time. However, validators must be trusted members of the Gnosis community. ### Better Monitoring Not knowing of issues until a user messages in the discord is not an ideal incident management strategy. Ideally, organizations running validators experiencing issues could be notifed (via smtp, sms, etc.) when the validator misses a perentage of attestations over a certain threshold. Had CowSwap, Protofire, and Giveth been notified that thier validators were experiencing issues when they first arose, they would have been able to take earlier action prior to any interruption to bridge service. ### Exponential Backoff Over the course of normal bridge operations (and with increased frequency during the outage), there are many failed `executeAffirmation()` and `submitSignature()` calls due to a validator attempting to call the methods to affirm the same transactions. This wastes gas, and could have contributed to how the Protofire validator ran out of funds. During the outage, online validators were witnessed to have been trying to affirm the same transaction sometimes over 10 times, and all but the first call were failures. This may be caused by validators not having an effective way to determine if they have already affirmed a transaction. Creating or improving this could help reduce wasted funds. Another way to remediate the volume of failed calls is with exponential backoff functionality: If a validator is experiencing call failures (due to network connectivity issues, or any number of reasons), it would wait an exponentially longer amount of time between calls until it either flips a "circuit breaker" (notifying operators and stopping calls for a period of time before retrying) or calls start succeeding.