Kreivo is a parachain, part of the Kusama network, that provides governance and payments services for decentralized companies (DAOs) with a focus on SMBs and web3 startups.
Friday, 29th June 2024, 3:55:42 UTC
The incident is still ongoing, as we need to surpass a governance-based authorization on Kusama to unlock the parachain management module, and then, upgrade the parachain with a fix. The expected date for the fix is around 10th to 12th June 2024.
At around 29/06/2024 5:12am UTC, we detected through Block Explorers that the block production had been stalled for two hours.
Pablo Dorado, Software Engineer from Internal Team.
The stalling of block production, and the inability to execute new extrinsics.
Figure out possible causes. Research on how to proceed to unlock the parachain.
Virto's Blockchain Team:
The lack of information regarding restoring access to the registrar
module. The lack of clear context as to why the block production stalled, since this case is not similar to prior experiences of parachains stalling.
After migrating Kreivo and it's associated dependencies (nodes, pallets) to a fork we maintain with changes that support essential functions for Kreivo (like dynamic tracks for governance, or holds for assets), to the —by then— latest version of Polkadot SDK 1.12., a migration we made to prepare Kreivo for Async Backing and running in 6-second blocks, we figured out that a change in the CheckWeight
logic to check for the combined PoV size (proof size + extrinsic length), introduced by polkadot-sdk#4326 led to inherents (required extrinsics that must be executed at the beginning of each block) to be incorrectly invalidated because of resources exhaustion (due to the excess in PoV size, detected by this change in CheckWeight
).
As a consequence of this, an inherent that should set the validation data gathered from the Relay Chain into the block state couldn't be executed and included in the block. Therefore, when trying to finalize the block, a logic that validates the existence of such info panicked, disallowing the block to be finished and authored.
A big jump from Polkadot SDK v1.5 to v1.12 made really difficult to catch-up the issue initially, as there were many possible factors around the stalling of block production: Initially, we believed this issue was produced because of a non-monotonic increase of block slots that occured immediately after the latest upgrade. Then, a theory introduced by prior experiences of parachains stalling led us to believe the PoV had been exceeded because of an unfinished migration that caused some pallets' logic to fail. Finally, we could conclude that the moment when the stalling ocurred coincided with the ending of a parachain's session.
Another factor was the lack of information about this fix. We could figure out that the error by hand, after removing CheckWeight
as a test measure in some simulations. Later on, we found out that a further fix (polkadot-sdk#4571) describing exactly this issue was merged mid-June, and then being backported to 1.12, but only on the release hosted in crates.io
, as the release tags for major versions of Polkadot SDK are kept as is. This PR couldn't have been discovered easily without the help of an architect from the Polkadot Technical Fellowship, but since it was an issue discovered on a testnet stage, and was quickly backported, there was no major spreading of this issue.
The resolution of the incident follows two steps:
registrar
of the parachain access to the parachain's manager (Daniel Olano), so it's possible to mark an upgrade, and apply a code substitute on the Kreivo's collators, in an effort to override the on-chain version of the runtime, allowing it to overcome this issue, and start producing blocks again. This step requires an approval from the Technical Fellowship, due to how we raised the referendum to make this unlock possible.So far, we're still in the way of making it return to normal operation.