Kreivo June 2024 Stalling post-mortem

1. Incident Overview

What is the Kreivo parachain?

Kreivo is a parachain, part of the Kusama network, that provides governance and payments services for decentralized companies (DAOs) with a focus on SMBs and web3 startups.

When did the incident occur (exact date and time)?

Friday, 29th June 2024, 3:55:42 UTC

What was the duration of the incident?

The incident is still ongoing, as we need to surpass a governance-based authorization on Kusama to unlock the parachain management module, and then, upgrade the parachain with a fix. The expected date for the fix is around 10th to 12th June 2024.

2. Impact

  • What services or functionalities were affected? The entire parachain is not in service at the moment. However, it is still possible to query information from the chain's database.
  • How many users or systems were impacted? 13 DAOs (according to https://preview.virto.app)
  • Were there any financial losses or other significant impacts? Virto DAO employees have lost around $4,632 due to the inability to receive payouts and the further loss of value of the KSM.

3. Detection

How was the incident detected?

At around 29/06/2024 5:12am UTC, we detected through Block Explorers that the block production had been stalled for two hours.

Who detected the incident (internal team, users, automated systems, etc.)?

Pablo Dorado, Software Engineer from Internal Team.

What were the initial signs or symptoms of the problem?

The stalling of block production, and the inability to execute new extrinsics.

4. Response

What immediate actions were taken to address the incident?

Figure out possible causes. Research on how to proceed to unlock the parachain.

Which teams or individuals were involved in the response?

Virto's Blockchain Team:

  • Pablo Dorado
  • Daniel Olano.

Were there any challenges or obstacles faced during the response?

The lack of information regarding restoring access to the registrar module. The lack of clear context as to why the block production stalled, since this case is not similar to prior experiences of parachains stalling.

5. Root Cause

What was identified as the root cause of the incident?

After migrating Kreivo and it's associated dependencies (nodes, pallets) to a fork we maintain with changes that support essential functions for Kreivo (like dynamic tracks for governance, or holds for assets), to the —by then— latest version of Polkadot SDK 1.12., a migration we made to prepare Kreivo for Async Backing and running in 6-second blocks, we figured out that a change in the CheckWeight logic to check for the combined PoV size (proof size + extrinsic length), introduced by polkadot-sdk#4326 led to inherents (required extrinsics that must be executed at the beginning of each block) to be incorrectly invalidated because of resources exhaustion (due to the excess in PoV size, detected by this change in CheckWeight).

As a consequence of this, an inherent that should set the validation data gathered from the Relay Chain into the block state couldn't be executed and included in the block. Therefore, when trying to finalize the block, a logic that validates the existence of such info panicked, disallowing the block to be finished and authored.

Were there any contributing factors?

A big jump from Polkadot SDK v1.5 to v1.12 made really difficult to catch-up the issue initially, as there were many possible factors around the stalling of block production: Initially, we believed this issue was produced because of a non-monotonic increase of block slots that occured immediately after the latest upgrade. Then, a theory introduced by prior experiences of parachains stalling led us to believe the PoV had been exceeded because of an unfinished migration that caused some pallets' logic to fail. Finally, we could conclude that the moment when the stalling ocurred coincided with the ending of a parachain's session.

Another factor was the lack of information about this fix. We could figure out that the error by hand, after removing CheckWeight as a test measure in some simulations. Later on, we found out that a further fix (polkadot-sdk#4571) describing exactly this issue was merged mid-June, and then being backported to 1.12, but only on the release hosted in crates.io, as the release tags for major versions of Polkadot SDK are kept as is. This PR couldn't have been discovered easily without the help of an architect from the Polkadot Technical Fellowship, but since it was an issue discovered on a testnet stage, and was quickly backported, there was no major spreading of this issue.

6. Resolution

How was the incident resolved?

The resolution of the incident follows two steps:

  1. Applying a backport of the PR polkadot-sdk#4571 that fixed the error introduced by polkadot-sdk#4326. This is already done.
  2. Unlocking the registrar of the parachain access to the parachain's manager (Daniel Olano), so it's possible to mark an upgrade, and apply a code substitute on the Kreivo's collators, in an effort to override the on-chain version of the runtime, allowing it to overcome this issue, and start producing blocks again. This step requires an approval from the Technical Fellowship, due to how we raised the referendum to make this unlock possible.

What steps were taken to ensure the parachain returned to normal operations?

So far, we're still in the way of making it return to normal operation.

7. Lessons Learned

What went well during the incident response?

  • We found out pretty quickly, and the technical-side solution was released in under two days (which was good, considering the serious lack of resources to find out).
  • This occured in a very early stage, where, with the exception of KSM.Community —an already active DV for Kusama—, and Virto —which has all of its corporate funds stored in Kreivo— our users are mostly beta testers with no significant funds held in the parachain.
  • This shouldn't affect our users' funds held in the parachain, as our native token is KSM, which was not affected in any way because of this issue (according to market correlations with similar assets in the ecosystem), and other major held tokens include DOT and USDT, but no proprietary tokens from Virto.

What could have been done better?

  • Releasing first in a testnet. We lacked a testnet that could help identify the issue before deploying on production.
  • We made a huge migration, causing it to be hard to identify the primary cause of the disruption.

What measures will be implemented to prevent similar incidents in the future?

  • Release on a testnet first.
  • Increase efforts in migrating to stable versions of the Polkadot SDK more often. Hopefully, the stabilization of the release cadence (from 2-3 weeks to 3 months) should help hugely in this step.
  • Push for a better communication of known issues of Polkadot SDK, for the best interest of parachain builders. We hope our experience helps us improving this process of communicating known issues.