On April 5, 2023, the Moonbeam Network experienced a brief interruption to block production as an unintended consequence of the approval of Referendum 88. The issue was a result of an on-chain referendum that was approved just prior to the runtime upgrade, but the call was scheduled to be executed right after. This article provides a detailed post-mortem analysis of the incident, outlining the sequence of events that led to the network halt and the subsequent steps taken to resolve the issue and prevent its recurrence.
Referendum 88, which included a system.remark call, was approved through community governance on block 3276000 and scheduled for execution on block 3291300.
Some blocks earlier (at block 3290853) before Referendum 88 is executed, the runtime upgrade RT2201 was successfully applied. This new runtime included a low-level change in Substrate that altered the call index of system.remark, causing it to match the call index for system.setHeapPages.
Due to this change, the scheduled system.remark call was unintentionally switched to a system.setHeapPages call. The new call had an invalid value, which prevented collators from producing blocks and ultimately led to the network halt.
The last block before the halt, block 3291299, was produced on April 5, 2023, at 14:43:24 UTC. The subsequent block, block 3291300, could not be produced because it included dispatching the scheduled call with the new, incorrectly configured HEAP_PAGES parameter.
Due to the prompt investigation from Moonbeam engineering contributors and Parity, a new client was published and made available to all nodes. This enabled the network to resume producing blocks after a downtime of approximately 4 hours.
Runtime 2201 included a low-level change in Substrate that altered the call index of system.remark, causing it to match the call index for system.setHeapPages. Under normal circumstances, this is not a problem because a call done on the new runtime is already assigned to the new call index.
Referendum 88 included a system.remark call and was initiated on RT2100. For this runtime, this call was assigned a call index of 1. When the referendum was approved, the network automatically scheduled the call to be dispatched on block 3291300. Nevertheless, this block was part of RT2201.
When trying to produce block 3291300, the execution of the newly mapped system.setHeapPages meant that a non-critical on-chain configuration value was changed so that collators were not able to produce blocks. Consequently, on April 5, 2023, at 14:43:24 UTC, the network stopped producing blocks.
Runtime upgrades go through several test networks, where they are thoroughly tested before reaching Moonbeam mainnet. The issue was not due to the runtime upgrade itself but due to a call scheduled on one runtime but then executed on another runtime, where the call indexes changed in between.
The Moonbeam team released a new client, version 0.30.3, to address the issue. The updated client ignores the incorrect HEAP_PAGES value stored on-chain, allowing collators to resume block production.
At 18:55:48 UTC, approximately 4 hours, 12 minutes, and 24 seconds after the initial issue, block production resumed with the creation of block 3291300.
As collators updated to the new client (v0.30.3), the network gradually began producing blocks at a regular cadence. The speedy upgrades by community collators to the new client were hugely important in helping the network return to its normal block production.
The Moonbeam Network halted due to Referendum 88 approval and the subsequent unintended switch from a system.remark to a system.setHeapPages call serves as an essential learning experience for the community.
The swift response of the Moonbeam engineering contributors to release a new client that addressed the issue demonstrates the project’s commitment to maintaining a secure and reliable network. The team also received invaluable help from Basti, a member of the Parity team. The incident highlights the importance of thorough testing and of both the runtime upgrades themselves, and situation-based on-chain governance scenarios.
An already-implemented solution has since been merged to prevent such changes in call indices in future runtime releases. For future instances, two main points will be addressed during runtime upgrades:
- A checklist of release conditions reviewed by all technical teams at least a day prior to updates to the client or runtime
- Improve testing tools to include verifying future referenda with new client/runtimes
- Going forward, the Moonbeam team and the community will continue to work together to enhance the network’s resilience and ensure its robust performance.
- Referenda 88 was approved, meaning a system.remark extrinsic was in the scheduler, to be executed on block 3291300
- RT2201 was successfully applied in block 3290853
- The new runtime included an underlying low level change in Substrate which changed the call index of system.remark, and this now matched the call index for system.setHeapPages. Consequently, the scheduled call was automatically switched from a system.remark to a system.setHeapPages
- The new call (system.setHeapPages) had an invalid value, which prevented collators from producing blocks, halting the network
- The last block before the halt (3291299) was produced April 5, 2023 14:43:24 UTC. The subsequent block (3291300) could not be produced because it included the dispatching of the scheduled call with the new wrongly configured parameter (HEAP_PAGES)
- A new client was released (client v0.30.3) to fix the issue. The new client ignores the wrongly set value for HEAP_PAGES, which is stored on-chain, so that collators could continue producing blocks
- At 18:55:48 UTC, 4 hours, 12 minutes and 24 seconds after the block production halt, block 3291300 is produced
- Once collators started updating to the new client (v0.30.3), network slowly started to produce blocks at the regular cadence