Moonsama executed the schedule_code_upgrade call on the relay chain to upgrade their parachain. The problem with this call is that it sets the GoAhead signal which then triggers the parachain to fail as it is not expecting the signal. There is an issue open to solve this by not setting the signal: https://github.com/paritytech/polkadot/issues/7202

I think we should be able to fix this from the parachain by overwriting the runtime code on the parachain side and then issuing a new upgrade.

Prepare a new runtime upgrade

  1. Take the same code that was used to generate the runtime that is currently running on the parachain. Not the runtime that was passed to schedule_code_upgrade.
  2. Patch cumulus to use a fork (again same branch/commit as being used for the on chain runtime). In this fork you need to remove the following code: https://github.com/paritytech/cumulus/blob/21919c89ad99fc4e2e6f12dd6679eb2d41892ac3/pallets/parachain-system/src/lib.rs#L397-L419
  3. Build the runtime that includes this patch (let's call it PATCHED_RUNTIME).

Apply the runtime to your parachain

Substrate is loading the runtime from the state. So, to build block X it takes the runtime from block X - 1. However, we support overriding this mechanism using the chain spec, but this will then require that every node on the network uses this chain spec.

The following needs to be added to the chain spec (check polkadot.json for an example):

"codeSubstitutes": {
    "132724": "HEX_STRING_OF_PATCHED_RUNTIME",
}

132724 is the number of the block from which one the patched runtime should be used. This is the number of the latest block that you build.

This chain spec then needs to be circulated on all nodes and then the collators should start to produce blocks again. However, the blocks will fail at validation on the relay chain. Why? Because we removed some code that is still part of the blob registered at the relay chain and there will be some mismatch at execution (basically it will fail with the same issue as your collators currently).

Call scheulde_code_upgrade again

You need to call schedule_code_upgrade again, but this time with the PATCHED_RUNTIME. After this has happened, your parachain and the relay chain should use the same code again to build and validate blocks.

You will not be able to directly call schedule_code_upgrade as there is currently some "cooldown" that basically prevents that parachains upgrade every block. Looking at the chain state (paras::upgradeCooldowns):

  ...
  [
    3,334
    16,253,944
  ]

This means that after block 16,253,944 you can call the function again. I also checked the other requirements and there should be nothing else that should prevent the call from succeeding (hopefully).

Aftermath

After all the things above are finished, you will need to upgrade your runtime again to the one that you passed the first time to schedule_code_upgrade (the one that brought us into this situation). But this time you use set_code on your parachain ;)

Evaluation

I had overseen this. This meant that you could not schedule another upgrade while one was already active. So, I had proposed the following:

So, what you could do. You setup some new genesis state
That has PendingValidationCode set
https://github.com/paritytech/cumulus/blob/21919c89ad99fc4e2e6f12dd6679eb2d41892ac3/pallets/parachain-system/src/lib.rs#L402-L406
The genesis runtime should be the one that you passed to schedule_code_upgrade
And you also put this runtime into PendingValidationCode
And then you call set_current_head with the header of your new genesis

This worked and they could bring back the chain online.