Draft process based on discussions from discord - edits welcome

Burn-in

Start the pre-release process on Friday (n-5) instead of Wednesday, to allow time for the burn in process.

All PRs for the release need to be merged by Friday.

Vote

While the burn-in is happening, maintainers and community have 3-4 days (2 business days) to find a reason to object.

Start a thread in besu-release channel

If any one party formally objects then a formal vote is required. At least 24 hours is required for a vote to close, and a scheduled release can be delayed for the vote.

Who can object?

Currently - besu maintainers

besu-release channel is public so any community member can comment

How to object?

in besu-release channel or 'the agreed communication channels'


It’s already the case that if burn-in testing fails, the release can be delayed or skipped.

The final stage of the release is making it public, applying tags etc and announcing it.


Technical details to be worked out

 re:branching, tagging - Friday we update the version in gradle.properties to the release version, briefly, then update to snapshot. We would have to do something about the docker latest tag, but we would not introduce additional processes like tagging. Wednesday "releases" would be limited to publishing a github release and official docker images for the version


I think ideally we would not publish docker version numbers until the official release date. latest seems to be the highest risk for unintentional upgrades. Having a separate action to publish docker images seems error prone - I suspect we can find a way to automatically gate that.

Because there are keys needed to move docker tags and publish, GHA seems the best place to put it for "on demand" use. Plus it comes with a who/when audit trail.


GHA - could implement this via a draft release, i believe there is an event type emitted for that. the application of release tags could be isolated to a workflow that just promotes from draft, using those tags

  • No labels

12 Comments

  1. The vote after the call to stop is missing an important detail that was in the discussion thread - a "Release Committee" decides if the release is a go/no-go after an objection is raised.  Without that feature CosnensSys employees can still force or stop a release over the objections of every other maintainer in Besu, which is the whole problem that brought us to this discussion.

    An alternative feature would be that votes from a single company are insufficient to change the initial outcome: either to stop a planned release or to cut an out of cycle release.  A majority that includes two different employers of maintainers would be sufficient for a change. Without a decentralized majority the schedule is followed and an out-of-cycle release is not cut.

    1. I strongly prefer the second option outlined by Danno here. I think it solves the root problem with minimal overhead, and is incremental on top of our existing process.

      The Burn-In process itself is defined pretty loosely, we should define that a bit more, at least till it is specific enough that we know what it will bring to the go-nogo decision.  ConsenSys has a pretty specific idea of what the burn-in period entails, and I'd like to ensure it is consistent with other members:

      1. a "from scratch" canary of besu paired with a CL (we use teku, but have all combinations represented elsewhere) which syncs to the network from an empty starting point, using X_CHECKPOINT sync.
      2. an "upgrade" canary of besu paired with a CL which operates on data from the prior release.
      3. non-cloud based compute, some sub-optimal hardware. we often test on raspberry pi's and nucs at home.

      Each of these burn-ins is manually validated for:

      1. Acceptable peering rates. Usually acquiring peers at a rate of faster than 1 per minute up to the default max. 
      2. Disk usage grows no worse than it used to. Probably some wiggle room on that.
      3. Block import times remain fast enough to keep up with the chain.
      4. Memory usage is close to previous, with GC collection below a certain threshold.
      5. No chain splits
      6. No chain halts
      7. Overall stability. No crashes.

      Probably some other things I'm missing...

    2. This seems reasonable.  IMO if there isn't a decentralized majority available due to holidays or timezones, that implies the release coordinator could put out an urgent request on discord and wait for a decentralized majority to be present to act.  

      More succinctly, if a decentralized majority isn't feasible, a delay until it one could be convened should be tolerated.

    3. Also currently prefer option 2 over RelCom, but still considering both.

      I am not a fan of reducing all 'Company A maintainers' to a single vote because it assumes an aligned agenda in terms of the release, which isn't always going to be the case. 
      All maintainers should be considerate of all besu users, even if your company only supports one user segment.

      A company-specific consensus can be achieved internally, but I think it's an interesting point whether any maintainer can veto a release, even if their company's maintainers have consensus.
      I think it's the difference between "any one party" meaning "any one company" vs "any one maintainer".

      Assuming any maintainer can veto due to a concern, I'd lean towards resolving that with a simpler maintainer majority or delegation to the current release coordinator(s) (once the concern was addressed) over a 'company majority'.
      Appreciate that's kind of what we had and it went awry but I'd rather we build trust and communicate better than replace that with a new process. 

  2. re:branching, tagging.

    If I understood it correctly, changing the version briefly is to set a placeholder for burn-in/CVEs. This would lead to the posibility of having multiple commits (e.g. when fixing a CVE) with the same "final" version on it. I want to propose another workflow, where we branch from main a frozen release (it could have been named release candidate, but I don't know if we don't want to keep that name for something else) and then we use this branch for burn-in/CVE. When the burn-in time has passed or when the RC decides that a CVE fix version is ready to deliver, the version is changed to final and tagged. The last step for this branch is be merged with main (this would lead to conflict in the version number at least, we must be solved in favor of the latest version). Here's is a diagram of how this could work. Here's a diagram with an example of how the git graph would look, where each node represents the version for that commit

             ▲
             │
      22.10.4-SNAPSHOT ◄──────────┐
             ▲                    │
             │                    │
             │                 22.10.3 (tag)
             │                    ▲
             │                    │
             │             22.10.3-FREEZE (CVE)
             │                    ▲
             │                    │
      22.10.4-SNAPSHOT     22.10.3-FREEZE
             ▲                    ▲
             │                    │
      22.10.3-SNAPSHOT ───────────┘
             ▲
             │

    Another invariant from now on would be that the main branch will be considered unstable for all its commits. What do you think?

    1. I don't think we need to change our perspective of main always being releasable just because we want additional burn in time to try to disprove that, but I like the branching approach.  It is more overhead, and could require a merge back into main, but it leads to an isolated build/test artifact with minimal disruption to the ability to merge to main. 

      In order to not end up with a lot of stale branches and exceptions necessary to create them on github, I would tweak your proposal slightly to use a common release branch, like the one that currently exists for the 22.10 quarterly series, `release-22.10.x`.

      I believe Danno Ferrin has a strong opinion about tagging specifically, since they are mutable.  If we cut a release from a branch rather than main, that release sha would not be part of the main trunk.  I am not sure if that is a sticking point or not.


      1. My concern with tagging instead of code checkin versions is that I want a tarball of the repo, like Github can return, to build the same as a git-hosted extraction with the same CLI calls.  Building off a stub is fine for me.

  3. I propose we allow the cut-off to be Friday morning Australia time, which is Thursday for everyone else. This will mean we can rotate duties between all timezones without having to cut a release on Australian Friday evening.

    Maybe some time in the range Thursday 10pm UTC - Friday 10pm UTC:
    e.g.
    https://everytimezone.com/s/1019c64a (Aus Friday morning, US afternoon)
    https://everytimezone.com/s/b2a4f70d (2pm PST)

    1. Yes, it might be better to have an understood cut-off time at a reasonable hour in our earliest time zone Friday.  

      1. Makes sense to have a single cut-off time so it's well-known and doesn't change, e.g. Thursday 10pm UTC.
        Then can probably flex when the actual cut is made to suite the release coordinators' timezones...probably not a big deal to include commits that have made it to main between the official cut-off time and actual cut.

    2. Will this be a hard "Must be on Main" cutoff or more of a "Release manager must know about it and it needs to be basically done, sans checkin pipeline and no-change reviews"

      1. I think we can be pretty flexible with what goes in, I think it's more of a expectation/target for committers to aim for, i.e. if you haven't merged to main by Thursday 10pm UTC then there's no guarantee your commit will go in, but it may do if the release manager starts the cut after you get it in.

        IOW we always cut from latest main, the cut time is flexible, but the earliest it will be is Thu 10pm UTC.