On Sat, 30 Mar 2024 at 14:16:21 +0100, Guillem Jover wrote: > in my mind this incident reinforces my view that precisely storing > more upstream stuff in git is the opposite of what we'd want, and > makes reviewing even harder, given that in our context we are on a > permanent fork against upstream, and if you include merge commits and > similar, there's lots of places to hide stuff. In contrast storing > only the packaging bits (debian/ dir alone) like pretty much every > other downstream is doing with their packaging bits, makes for an > obviously more manageable thing to review and not get drown into, > more so if we have to consider that next time perhaps the long-game > gets played within Debian.
I'd like to push back against this, because I'm not convinced by this reasoning, and I'd like to provide another point of view to consider. I find that having the upstream source code in git (in the same form that we use for the .orig.tar.*, so including Autotools noise, etc. if present, but excluding any files that we exclude by repacking) is an extremely useful tool, because it lets me trace the history of all of the files that we are treating as source - whether hand-written or autogenerated - if I want to do that. If we are concerned about defending against actively malicious upstreams like the recent xz releases, then that's already a difficult task and one where it's probably unrealistic to expect a high success rate, but I think we are certainly not going to be able to achieve it if we reject tools like git that could make it easier. Am I correct to say that you are assuming here that we have a way to verify the upstream source code out-of-band (therefore catching the xz backdoor is out-of-scope here), and what you are aiming to detect here is malicious changes that exist inside the Debian delta, more precisely the dpkg-source 1.0 .diff.gz or 3.0 (quilt) .debian.tar.*? If that's your threat model, then I don't think any of the modes that dgit can cope with are actually noticeably more difficult than a debian/-only git repo. As my example of a project that applies patches, I'm going to use bubblewrap, which is a small project and has a long-standing patch that changes an error message in bubblewrap.c to point to Debian-specific documentation; this makes it convenient to tell at a glance whether bubblewrap.c is the upstream version or the Debian version. There are basically three dgit-compatible workflows, with some minor adjustments around handling of .gitignore files: - "patches applied" (git-debrebase, etc.): This is the workflow that proponents of dgit sometimes recommend, and dgit uses it as its canonicalized internal representation of the package. The git tree is the same as `dpkg-source -x`, with upstream source code included, debian/ also included, and any Debian delta to the upstream source pre-applied to those source files. In the case of bubblewrap, if we used this workflow, after you clone the project, bubblewrap.c would already have the Debian-specific error message. (dgit --split-view=never or dgit --quilt=dpm) - "patches unapplied" (gbp pq, quilt, etc.): This is the workflow that many of the big teams use (at least Perl, Python, GNOME and systemd), and is the one that bubblewrap really uses. The git tree is the same as `dpkg-source -x --skip-patches`, with upstream source code included, and debian/ also included. Any Debian delta to the upstream source is represented in debian/patches but is *not* pre-applied to the source files: for example, in the case of bubblewrap, after you clone https://salsa.debian.org/debian/bubblewrap.git and view bubblewrap.c, it still has the upstream error message, not the Debian-specific one. (dgit --quilt=gbp or dgit --quilt=unapplied; I use the latter) - debian/ only: This is what you're advocating above. The git tree contians only debian/. If there is Debian delta to the upstream source, it is in debian/patches/ as usual. (dgit --quilt=baredebian* family) In the "patches applied" workflow, the Debian delta is something like `git diff upstream/VERSION..debian/latest`, where upstream/VERSION must match the .orig.tar.* and debian/latest is the packaging you are reviewing. Not every tree is a valid one, because if you are using 3.0 (quilt), then there is redundancy between the upstream source code and what's in debian/patches: it is an error if the result of reverting all the patches does not match the upstream source in the .orig.tar.*, modulo possibly some accommodation for changes to **/.gitignore being accepted and ignored. To detect malicious Debian changes in 3.0 (quilt) format, you would want to either check for that error, or review both the direct diff and the patches. Checking for that error is something that can be (and is) automated: I don't use this workflow myself, but as far as I'm aware, dgit will check that invariant, and it will fail to build your source package if the invariant doesn't hold. dpkg-source in 3.0 (quilt) format will also make your source package fail to build if the desired invariant isn't true, except where intentionally ignored (ignoring deleted files, ignoring chmod +x, etc.). In the "patches unapplied" workflow, the Debian delta is exactly the contents of debian/, including debian/patches. There is redundancy between the upstream source code and the Debianized branch: it is an error for `git diff upstream/VERSION..debian/latest` to show any changes outside debian/, except for possibly **/.gitignore. To detect malicious Debian changes, you would want to check for that error. Again, checking for that error is something that can be (and is) automated: I use this workflow myself (e.g. in bubblewrap), so I know from experience that dgit *does* check for that error, and will fail to build the source package if the invariant does not hold. Again, dpkg-source in 3.0 (quilt) format will also make your source package fail to build if that error exists, except in the cases that it intentionally ignores. In the "debian/ only" workflow, the Debian delta is exactly the contents of debian/. There is no redundancy, so every tree is in some sense a valid one (although of course sometimes patches will fail to apply, or whatever). In summary, my claim is that having the upstream source code in git is a good thing because it makes detecting malicious upstreams less difficult than in a debian/-only workflow, while not making detecting malicious packagers noticeably harder: the only extra thing that is necessary is to carry out a check that can be done automatically, to assert that the tree is internally-consistent for whichever workflow is in use. smcv