On pull request workflows for the GNU toolchain

Joseph Myers via Gcc Thu, 19 Sep 2024 08:55:56 -0700

1. Introduction

This message expands on my remarks at the Cauldron (especially the
patch review and maintenance BoF, and the Sourceware infrastructure
BoF) regarding desired features for a system providing pull request
functionality (patch submission via creating branches that are then
proposed using some kind of web interface or API, with a central
database then tracking the status of each pull request and review
comments thereon automatically), for use by the GNU toolchain (or one
or more components thereof - there is no need for each component to
make the same decision about moving to such software and workflow, and
indeed we have no mechanism to make such decisions for the toolchain
as a whole).


This does not advocate a particular choice of software for such
functionality (though much of the discussion seemed to suggest Forgejo
as the most likely starting point), or a particular choice of where to
host it.  Hosting would of course need to meet appropriate security
requirements, and to achieve a passing grade on the GNU Ethical
Repository Criteria, and the software would need to be entirely free
software.  Where relevant features are not already supported, it's
important that the software is receptive to the addition of such
features (including cases where part of the functionality is provided
by software specific to the GNU toolchain or parts thereof - such as
for the custom checks currently implemented with git hooks - and the
underlying software provides appropriate interfaces to allow
integration of such external pieces).  The list of features here may
be a good basis for reviewing what particular forge software supports
and whether other features can be added, directly or through use of
appropriate APIs.

Forge software may provide other pieces such as bug tracking or wikis
that we currently handle separately from git hosting.  In such cases,
we should be able to disable those pieces and keep using the existing
bug tracking and wiki software (while having the option to decide
independently to migrate those if desired).

I consider the overall benefits of such a move to be having more
structured data about all changes proposed for inclusion and their
status (needing review, needing changes from the author, under
discussion, needing merge from mainline, etc.), to help all people
involved in the patch submission and review process to track such
information and to find patches needing review as applicable, along
with providing a more familiar workflow for many people that avoids
many of the problems with email (which affect experienced contributors
working around corporate email systems, not just new contributors).
It would not of course by itself turn people with no interest in or
understanding of systems software development into contributors (for
example, people without knowledge of directories and hierarchical file
storage, or people who only understand software development as web
development).  Nor would it prevent the accumulation of large backlogs
of unreviewed patches, as is evident from many large and active
projects using PR workflows with large numbers of open PRs.

As Richard noted in his BoF, email sucks.  As I noted in reply, so do
the web and web browsers when trying to deal with large amounts of
patch review state (when one wishes to apply one's own view, not the
forge's, of what is resolved and what needs attention).  As I also
noted, in the Sourceware infrastructure BoF, tools such as patchwork
and b4 are the right answer to the wrong question: trying to get
structured data about patch submissions when working from the axiom
that emails on a mailing list should be the primary source of truth
for everything needing review, rather than starting from more
structured data and generating emails as one form of output.

Moving to a pull request system is not expected to change policies
regarding who can approve a change for inclusion, or the technical
limits on who can cause a change to enter mainline (e.g. all people
with commit access would be expected to be able to use a button in a
web interface to cause to PR to be merged, though policy might limit
when they should do so).  We can of course choose to change policies,
either as part of adopting a PR system or later.


2. Key features

(a) Some forges have a design that emphasises the tree you get after a
proposed contribution, but not the sequence of commits to get there.
For the toolchain, we care about the clean, logical sequence of
commits on the mainline branch.  (We also use linear history on
mainline, but that's a secondary matter - though certainly we'd want
any forge used to support such linear history so that property doesn't
need to change as part of adopting pull request workflow.)  Having a
clean sequence of commits has some implications for forge support:

* Support for reviewing the proposed commit message, not just the
  diffs, is important (and it should be clear what commit message
  would result from merging any pull request).

* Patch series and dependencies between patches are important.  In
  such cases, fixes for issues from review should go into the
  appropriate logical commit in the PR (with rebasing as necessary),
  and it should be possible at all times to see what the sequence of
  commits on mainline would look like.  (E.g. people use some
  workarounds on GitHub to manage PR dependencies, but those result in
  second and subsequent PRs in a series showing the full set of diffs
  from a PR and those it depends on, rather than just the logical
  diffs for one PR.)

  I consider patch series and dependencies to be separate but related
  things: a patch series may not have strictly linear dependencies
  (and it's often useful to merge the more straightforward patches
  from a series while still discussing and revising others), while a
  patch may depend on other patches that it's not logically part of
  the same series as.  They are, however, closely related, and a
  sufficient solution for dependencies might also be adequate for many
  cases of series.

  Note that series can sometimes have hundreds of patches; any
  solution for patch series and dependencies needs to scale that far.

  There is of course the common case of a single-patch submission,
  where the patch is ready for inclusion after some number of fixes.
  In those cases, it's probably convenient if it's not necessary to
  rebase - provided it's clear that a particular PR would be
  squash-merged, and also what the commit message would be for the
  final fixed commit.

* Given the need for rebasing when working with patch series, it's
  important to have good support for rebasing.  In particular, all
  revisions of the changes for a PR that was rebased need to remain
  permanently available (e.g. through appropriate documented refs to
  fetch to get all revisions of all PRs).

(b) Various people have built efficient workflows for going through
all patch submissions and comments (or all in a particular area), even
when only reviewing a small proportion, and have concerns about
efficiency of a web interface when working with many patches and
comments.  It's important to have good API support to allow people to
build tools supporting their own workflow like this without needing to
use the browser interface (and following their own logic, not the
forge's, for what changes are of interest).  Good API support might,
for example, include a straightforward way to get all changes to PR
and comment data and metadata since some particular point, as well as
for actions such as reviewing / commenting / approving a PR.  Such API
support might be similar to what's needed to ensure people can readily
get and maintain a local replica of all the key data and its history
for all PRs.

Replication like that is also important for reliably ensuring key data
remains available even if the forge software subsequently ceases to be
maintained.  Consider, for example, the apparent disappearance of the
data from www-gnats.gnu.org (we occasionally come across references to
old bug reports from that system in glibc development, but don't have
any way to resolve those references).

Another use of such an API would be to allow maintaining local copies
of all PR comments etc. in a plain text form that can be searched with
grep.

(c) Given that a transition would be from using mailing lists, it's
important to have good plain text outward email notifications for all
new merge requests, comments thereon and other significant actions
(changing metadata, approvals / merges, etc.) - through whatever
combination of built-in support in a forge and local implementation
using the API.  Such notifications would go to the existing mailing
lists (note that the choice of mailing lists for a patch can depend on
which parts of the tree it changes - for example, the choice between
binutils and gdb-patches lists, or some GCC patches going to the
fortran or libstdc++ lists) - as well as to individuals concerned with
/ subscribed to a given PR.  Some very routine changes, such as
reports of clean CI results, might be omitted by default from
notifications sent to the mailing lists.

"good" includes quoting appropriate diff hunks being comments on.
It's OK to have an HTML multipart as well, but the quality of the
plain text part matters.  Diffs themselves should be included in
new-PR emails (and those for changes to a PR) unless very large.

I do not however suggest trying to take incoming email (unstructured)
and turn it into more structured comments on particular parts of a PR
in the database.

Similarly, commit emails should continue to go to the existing mailing
lists for those.

(d) Rather than the forge owning the mainline branch in the sense of
every commit having to go through an approved PR, at least initially
it should be possible for people to push directly to mainline as well,
so the transition doesn't have to be instantaneous (and in particular,
if a change was posted before the transition and approved after it, it
shouldn't be necessary to turn it into a PR).  Longer term, I think it
would be a good idea to move to everything going through the PR system
(with people able to self-approve and merge their own PRs immediately
where they can commit without review at present), so there is better
structured uniform tracking of all changes and associated CI
information etc. - however, direct commits to branches other than
mainline (and maybe release branches) should continue to be OK long
term (although it should also be possible to make a PR to such a
branch if desired).

Beyond putting everything through PRs, eventually I'd hope to have
merges to mainline normally all go through a CI system that makes sure
there are no regressions for at least one configuration before merging
changes.

(e) All existing pre-commit checks from hooks should be kept in some
form, to maintain existing invariants on both tree and commit contents
(some hook checks make sure that commits don't have commit messages
that would cause other automated processes to fall over later, for
example).

(f) Existing cron jobs that read from or commit to repositories should
use normal remote git access, not filesystem access to the repository,
including using APIs to create and self-merge PRs when pushing to the
repository is involved.

-- 
Joseph S. Myers
[email protected]

On pull request workflows for the GNU toolchain

Reply via email to