Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Executive Summary:

On January 6th, 2024 at block 18947893, an estimated 70% of Besu nodes on Mainnet experienced a halt. This included those running recent software releases, up to and including 23.10.3. Affected clients ceased processing blocks and updating the Ethereum state, until workarounds were communicated out to users, and a hotfix was published.

What follows is a more detailed outline of the event, how it was mitigated, and the reasoning processes involved.

Preamble:

In late November, during development of new features, the Besu team identified a problem with the Bonsai state storage format.  Specifically, how Bonsai encoded the state changes in “trie logs.”  Trielogs are essentially a delta of the ethereum state from block to block - they record the changed state values prior to block execution (preimage) and the new values that result from block execution.  The defect related to how Besu logged metamorphic contract deployment.  In some specific cases Besu could write an incorrect preimage in the trielog for storage that was self-destructed.  This defect, in some form, had been present in the Bonsai code for quite some time. We had been working on a couple fixes which addressed the root of the problem.   

...

Leading up to 23.10.3, we took great care to exercise the fix.  We completed a full sync on all test nets and found Besu had no problem with the blocks that halted it previously.  An additional Sepolia node was running a special version of Besu that would explicitly roll the state back and forth on each block to exercise the trielogs and ensure they were created and could be applied correctly.  When that testing completed we were confident we had fixed the issue.  Dec 30 we announced the release 23.10.3 on the hyperledger discord and strongly recommended users upgrade to that version.  Unfortunately the github release was left in draft, so some users that monitor github releases did not see the new release.  What we didn’t realize was that full sync testing operates directly on the main mutable world state, and does not exercise some aspects of block execution that are used in the engine api.  So the bad trielog bug was still lurking even in the latest version.

The Main Event:

Christmas and New Years were quiet for Besu nodes.  There wasn’t anything in particular that stood out as a problem, until the morning of Jan 6. A Besu engineer was the first to note that something was amiss when nearly all of our Mainnet nodes halted at block 18947983.  A freshly funded account had deployed a contract and sent transactions resulting in a bad trie log that halted most versions of Besu on Mainnet.

...

The fix was good and the auto-heal was good enough, so we decided to release the hotfix asap since there was nothing preventing some new similar block from halting all of the newly recovered Besu nodes.

After publishing the hotfix, we announced on discord and communicated to users through various channels that it was important to upgrade even if they had already recovered their node(s).  We continued to run support through discord for a day or two to help as many users as possible, and things eventually relaxed a bit and gave us time to reflect.

Circumstances:

The timing is suspicious.  The contract deployed on Mainnet did not look like the experiment that was done on Sepolia.  If the Mainnet contract was indeed intentionally malicious, it was well camouflaged to look innocuous, if peculiar.  The contract deployment followed a similar pattern of being funded just before the incident, with no transaction history before or since, and all the contracts involved were self-destructed afterwards.  

If this was an intentional attack, the attacker is certainly clever, but without clear motivation.  Perhaps there was some sunk cost after watching our attempts to patch this issue. They may have wanted to trigger this bug while they still could before a release.  As a reminder, halting bugs of this severity are eligible for the EF Bug Bounty program. The deployer could have earned thousands in USD as a bounty for this finding. More importantly their discovery could be celebrated publicly and earn the goodwill of client teams and the ethereum ecosystem.  The rewards of cooperation are better in this case, certainly. 

Conclusion:

This experience has highlighted the need for client diversity on Ethereum. Thankfully, Besu being impacted as a client under the 33% threshold caused no finality issues for the network nor did besu users have a significant correlated inactivity leak on Mainnet. Our learnings from this experience underscore the importance of maintaining canaries alongside an archive CL node. Given the complexity of managing state for Bonsai, we understand that it is crucial to implement specific test cases to prevent accidental future regressions. Moving forward, we plan to create more Besu-specific tests for Bonsai and to introduce new Hive engine API tests for this particular case. In a more light-hearted vein, we intend to celebrate the deprecation of the self-destruct opcode with a bottle of champagne at DenCun!