Russell Stuart <russell-deb...@stuart.id.au> writes: > Has it been done? Given this point has been raised several times before > if it hasn't been done by now I think it's reasonable to assume it's > difficult, and thinking that it's so is not excessively pessimistic.
Oh, it's news to me that anyone has raised this before. I was assuming no one had bothered to try yet because it wasn't relevant. Intuitively it feels like a much easier problem than reproducible binaries given the nature of a Git repository. The hardest part is probably the same as with tar: how to keep the output reproducible over time as Git changes. I haven't tried it, though. > I personally wonder how the mirrors are expected to handle .git > repositories. That would increase the number of files they have to > handle by a couple of orders of magnitude. What are the plans for that? > Maybe you think that can handle it? Maybe you plan to abandon the > mirror network in favour of something else like the CDN? Maybe you plan > to remove the source from the mirrors? I was implicitly assuming the actual source format would be some archive of the Git repository rather than the raw Git repository. I agree that distributing raw Git repositories and thus tons of separate files per source package doesn't sound like a good idea. Although that does mean that one can't just point a Git client at the archive, which would have been neat. More on that below. Agreed that it's worth saying that explicitly, and it might be worth some thought on what the best archive format would be, since tar has proven troublesome for reproducibility. I gave this some more thought over dinner and realized that my previous message wasn't very constructive. Let me try to make up for that by describing what my goals are. In writing this up, I realized that these goals may not need to be met by the archive. It feels awkward and less than ideal to me to have multiple distribution points for source packages in different formats, but it could be less awkward than the alternatives, I suppose. My goals (some of which are already met by dgit) are: 1. Every package in Debian has a canonical representation of its source history in Git, with a branch structure that reflects the divergence between different archive suites. This history has at least one commit per upload, although ideally has the package maintainer's full revision history and upstream's full revision history. 2. This representation is readily available in some straightforward way (git clone would be ideal, some equivalently simple tool would be fine). 3. Every uploaded package clearly and unambiguously maps to a signed tag in the Git repository in the appropriate place in the revision history. 4. It's possible to upload a new version of a package to Debian (if one has the relevant permissions) by adding a signed tag and pushing to some Git remote. If that upload is successful (which at least involves permission and sanity checks and ideally involves a test suite), that new upload appears in the canonical Git repository. This should not require rewriting the branch or tag relative to the maintainer's local repository; in other words, it should match the Git tree that the maintainer tagged. All of these together allow us to interact with the archive the way that is now common to interact with other large Git projects, following any of the standard Git workflows and using Git as the native tool for expressing changes and tagging releases. (At this point, I think it's safe to say that Git has sufficiently won the VCS wars that any future wildly popular VCS will have some mechanism to bidirectionally interact with Git repositories.) I believe dgit already does 1-3. tag2upload would achieve 4. In looking this over, none of this precludes the source format 4.0 that Bastian proposed, provided that there was some way to export that source format easily and simply from point 4. Maybe it doesn't matter what's published in the source repository if everyone who wants this workflow uses some other service to interact with the Git repositories instead. If this were available, I personally would stop using Debian source packages entirely and forget that they even exist, and would only use the above workflow. Source packages then become an internal implementation detail of the archive that no one needs to care about unless they want to, or unless they're maintaining the dgit import service. It feels inelegant to me to have multiple publication mechanisms and multiple canonical formats and the ongoing cost of conversion from one to the other, but maybe that's already a sunk cost and it's worth paying it to avoid having tedious arguments? That said, Bastian's point about what we should do if we find that the Git repository contains something that isn't distributable is valid and needs to be dealt with regardless. I think one of our points of disagreement is that I don't see how this is a concern specific to the archive; we already have this problem because Salsa is an official project service, so we need to solve this problem for arbitrary Git repositories already. I realize there are technical reasons why, given the current software implementations, rewriting a Salsa repository is far easier than redacting source packages, and since there's more "stuff" in a Git repository, there are more opportunities for things to go poorly. However, I think it's excessively optimistic to believe that no one will ever accidentally add undistributable work to a maintainer upload of a package in a change that didn't need to go through NEW, at which point we will have this problem with source packages anyway. > Finally, there are more consumers of the source format than the Debian > packagers. For example, I regularly download Debian source packages > just to figure why the hell something isn't working as I expect. When > I do that, there are two things that are important to me: > 1. The download is as small as possible, and doesn't require a > specialised tool. (Github and gitlab go to the trouble of > providing just such as thing, which I think is evidence it's > needed.) The current format is pretty good in this area. At > a pinch you can get away without using deb-source to unpack it. I agree this is desirable but disagree that the current format is very good at all. Unpacking the current format in all of its generality requires either rather arcane steps or a specialized tool. I think it's a matter of opinion whether the current 3.0 (quilt) format with all of its complexity is better or worse on this point than a (possibly shallow) Git repository in a tarball. I personally think it's worse, but I can see arguments either way. You're on somewhat stronger ground with 3.0 (native), which I think meet point 1 quite well, and 1.0, which isn't great but which is somewhat better than 3.0 (quilt) on this specific metric. > 2. The point that has been raised here - reproducible builds of the > source package. By that I mean a reproducible build should be > pure function that is given the upstream source package and some > data in the form of patches or whatever, and ends up with the > source and build instructions. Being a pure function it always > produces the same outputs give the same inputs. I don't agree with this definition of reproducibility. You're defining reproducibility from inputs that I consider build artifacts, which to me is rather weird. The canonical source representation of all of my Debian packages is a packaging Git repository plus, for non-native packages, one or more upstream release artifacts. I define reproducibility as generating the same Debian source package from a signed Git tag of my packaging repository plus, for non-native packages, whatever release artifacts upstream considers canonical (which may be a signed tarball or may be a Git tag or may be something else entirely). All of this business with patches and whatnot is an implementation detail. -- Russ Allbery (r...@debian.org) <https://www.eyrie.org/~eagle/>