Context
Its dawned on us that Besu now represents upward of ~$5B on Mainnet in stake !! Great job for everyone https://supermajority.info/
This is a big responsibility, not just on Mainnet, but other chains that use Besu too. We are representing more and more value and reputation for a ton of organizations. The Consensys team is thinking on how we may be more careful in the release process (specifically regressions). As we approach the Cancun fork, is it OK to delay less-tested PRs on Main to avoid releasing any issues (especially in 24.1.2)? The problem is, users may not be able to safely downgrade after the fork without a manual intervention / new release. Is anything going into this next release that may be less tested? or a default that should be hidden behind a feature flag? Additionally - I wanted to crowd source ideation about regressions. We have had a handful of interventions on releases recently. Instead of just entirely slowing down releases, any ideas from maintainers? The Consensys team is scheduling a sprint to look at testing gaps and improvements that can be made in the codebase. If anyone would like to help in this, let me know! Thanks all!
Ideation
- integrate Hive testing into CI/CD pipeline.
- We should measure code coverage and make it a condition for merging a PR
- benchmark metrics - sync time, block import time etc and test PRs for regression
- some of these metrics (sync time, peering) may require long-running tests so may not be practical to run on individual PRs - could be on a nightly cadence
- Ensure test coverage for key features eg bonsai
- tests that run on mainnet config aren't testing upcoming hard fork features - to avoid finding bugs at the last minute, change this to holesky/sepolia (or make it configurable?)
- burn-in on existing (synced) nodes vs syncing from scratch
- backward sync is triggered when you upgrade
- it would be awesome to make the (Consensys internal) burn-in process less manual.
- don't want to make the release process more painful
What will be the outcomes of this epic?
Phase 1 (Analysis)
Aiming to complete this by end of March.
List of recommendations / Prioritized list of gaps to be addressed
- An Epic/workstream is created with actionable item based on the findings from the brainstorm
- Would be nice to have an idea of the relative amount of work involved so we can compare cost to benefit.
Phase 2 (Solutions)
Start solving the problems identified. May need different resources at this point.
What is the problem we’re trying to solve?
Why has this come up now?
More money is at stake
Besu profile is rising
Besu gaining market share
Greater reputational risk
What would be catastrophic?
- Consensus issue
- Sync issue
- Peering issue (eg Besu nodes lose 100% of peers and drop off the network)
- Chain fork
- Node database corruption or significant downtime (quite impactful at scale)
- Missing or empty block production
- Anything with customer funds or keys (includes Web3signer)
Need to verify that our bug prioritization matrix aligns with this.
Bad (but not catastrophic)
Anything here could escalate to catastrophic if the scale was big enough.
- Missed attestations
- Regression in block processing time
- Regression in sync times or performance
- Bad UX (nodes are a pain to manage)
- Substantial drop in performance
- Block export/import broken
Goals
- No P1 bugs released
- No P2 bugs released
- One P2 bug per quarter is tolerable (but our goal should be zero)
- We have P2 bugs identified, in the backlog - how do we make sure we don’t release them
- Ditto bugs that haven’t been prioritized (unknown - maybe they are P2 or even P1)
- Robust processes
- we know that PRs have sufficient automated tests and/or require evidence of manual tests where this makes sense
- Don't want to make the release process more painful
- But could make the release process more robust/comprehensive if doing so doesn’t add hassle/pain to the process?
- Ensure we’re spending effort where it makes sense
Opportunities in our (current) process
- Bug prioritization
- Triage matrix - review
- PRs
- Feature toggles - new code paths are disabled by default?
- Note this requires some extra effort to leverage DI (Dagger)
- Feature toggles - new code paths are disabled by default?
- Merge to main
- Nightly
- Nightly build could publish to a "nightly" or "develop" version so we can point nightly canaries to it instead of relying on remembering to update them. https://github.com/hyperledger/besu/issues/6426
- Confirm that automation around deployment actually deploys a new build each night, should be confirmed by self-reporting the git sha at startup. https://github.com/hyperledger/besu/issues/6637
- Canaries updated with nightly build
- Pre-release
- Enterprise-oriented testing - Incorporating Matt Whitehead's suggestions
- https://wiki.hyperledger.org/display/BESU/Testing
- And ensuring enterprise PRs don't leak bugs into main
- Maybe use PEEPS https://github.com/Consensys/PEEPS as a reference but probably makes sense to start with a new repo - might be easier to start from Besu AT framework than the PEEPS one. The PEEPS framework runs everything in Docker so could be slow and the DSL is quite cumbersome to use as everything is very generic so it works with both Quorum, Besu and EthSigner.
- Bring this on board later? Part of burn-in or a separate ad-hoc process?
- A big part of the release process is regression testing - but it’s very manual
- burn-in on existing (synced) nodes vs syncing from scratch
- it would be awesome to make the (Consensys internal) burn-in process less manual.
- Post-release
- Retrospective to fold new learnings back into our processes and automation.
Gaps
Tickets in the epic
See Testing Gaps Epic for the stack ranked issues. (For those without zenhub licences, the epic is here in Github but it won't show all the detail.)
- Hive testing - can we integrate Hive testing into CI/CD pipeline? Or at least an intermediate step (less effort) that will enable us to easily view new failures
- Smoke test: test besu starts up (with default options?)
- Context: our tests didn’t pick up that the data directory wasn’t being created automatically
- https://github.com/hyperledger/besu/issues/6610
- Simon: Downgrade test - would be good to have a regression test for each release to check we can downgrade to previous version(s).
- Automate burn-in
- Not much in the way of “initial sync” tests at the IT/AT level. Should review what hive provides for this. https://github.com/hyperledger/besu/issues/6652
- Decouple ATs from mainnet.json
- most acceptance tests use mainnet.json, so bugs on adding a new fork (eg cancun) may not get covered until much later in the SDLC.
- to avoid finding bugs at the last minute, change this to holesky/sepolia (or make it configurable?)
- https://github.com/hyperledger/besu/issues/6653
- Fuzz testing https://github.com/hyperledger/besu/issues/6655
- Fuzz testing on critical areas e.g. Bonsai trielogs - something like tx-fuzz
- Code coverage Aggregation https://github.com/hyperledger/besu/issues/6650
- We should measure code coverage
- and consider make it a condition for merging a PR?
- Snap gaps!
- Besu serving besu https://github.com/hyperledger/besu/issues/6656
- Current ATs test (previous default) forest storage format - need to update to bonsai
- https://github.com/hyperledger/besu/issues/6583
- reference tests have been updated already
- Can we fail faster with local and CI builds - improve devx and reduce context switching
- https://github.com/hyperledger/besu/issues/6740
- Flaky tests are a major barrier here. People will run more tests locally when they are more reliable, and don’t require multiple runs for confidence.
- Does GHA have metrics for flaky tests / see it manually
- Integration tests - are very out of date
- Update or remove - https://github.com/hyperledger/besu/issues/6739
Covered elsewhere
- Is there any coverage of different solidity versions?
- No there isn’t.
- This would be a very narrow gap - besu EVM is thoroughly tested by reference tests. Anything falling into this gap would likely be a bug in solidity, and caught elsewhere.
- Something like Pact contract tests for engine API for various CLs?? Again, might be covered by Hive.
- Think this is covered by Hive tests
- Devwatch / operational oversight / dogfooding
- See DWIP separate effort for improving devwatch itself
- Is observability a quality issue?
- QA / devops / holesky/sepolia validators
1 Comment
Gabriel Camargo Fukushima
It would be great to have a test suite to detect performance regression of critical features. I think at the moment some features rely on extracting and comparing these metrics "manually". E.g A new PR needs to go through this suite to ensure it hasn't degraded the block import performance.