Archiving all package sources

2022-11-13 Thread Jelle van der Waa

Hey all,

For packaging we now rely on external parties to keep the source code 
hosted which can be a problem and some packagers want to search through 
all our packages code. [1]


Currently we already archive sources using `sourceballs` on 
repos.archlinux.org for GPL licensed packages, this is limited to a 
subset of all packages and done after the fact (A timer which runs every 
8 hours and part of dbscripts). sourceballs calls `makepkg --nocolor 
--allsource --ignorearch --skippgpcheck`. This can be a problem as it 
runs after the package has been committed and it other network issues 
which might occur specific to the server. (ie. source cannot be 
downloaded where server is hosted)


To make this more robust, when committing a package using communitypkg 
or equivalent we also rsync the sources to a location on 
repos.archlinux.org (Gemini). This means the sources are consistent, and 
this opens the ability to implement a fallback or to change devtools to 
look at our sources archive when building a package. That would benefit 
reproducible builds as well and automated rebuilds.


Searching through our source code would be a next nice to have, most 
solutions such as sourcegraph/hound require a Git repository. [3] [4]
So maybe we can hack up a repository which just git adds all directories 
and keeps one git commit? That should probably be not too much of a 
waste. But the first proposal is to first archive all our code in a way 
it can be consumed by a search solution.


Questions:

* How do we deal with archiving patches, PKGBUILD's etc. for GPL 
compliance (just save it next to the code?)
* How do we determine when sources can be removed / cleaned up (we can't 
store things forever). DBscripts hooks?

* Do we have enough disk space for archiving?

$ du -hs
 111G/srv/ftp/sources/

[jelle@gemini ~]$ df -h /
Filesystem  Size  Used Avail Use% Mounted on
/dev/sdb219T  6.3T   12T  35% /

[1] https://gitlab.archlinux.org/archlinux/ideas/-/issues/2
[2] https://sources.archlinux.org/sources/
[3] https://github.com/sourcegraph/sourcegraph
[4] https://github.com/hound-search/hound/



OpenPGP_signature
Description: OpenPGP digital signature


Re: Archiving all package sources

2022-11-13 Thread Evangelos Foutras
On Sun, 13 Nov 2022 at 18:43, Jelle van der Waa  wrote:
>
> * Do we have enough disk space for archiving?

You have to take gemini's daily backups into consideration as well.
They already take up a lot of backup space [1] and several hours to
finish every day. I can see gemini falling over if we start processing
and storing a lot more sources on it.

# journalctl -o cat --since -5d --grep Duration -u borg-backup
Duration: 4 hours 55 minutes 12.55 seconds
Duration: 4 hours 21 minutes 40.80 seconds
Duration: 4 hours 34 minutes 10.01 seconds
Duration: 4 hours 41 minutes 27.67 seconds
Duration: 5 hours 24 minutes 44.55 seconds

# journalctl -o cat --since -5d --grep Duration -u borg-backup-offsite
Duration: 4 hours 9 minutes 2.50 seconds
Duration: 4 hours 24 minutes 19.42 seconds
Duration: 4 hours 40 minutes 17.96 seconds
Duration: 5 hours 33 minutes 47.30 seconds
Duration: 8 hours 2 minutes 20.43 seconds

td;dr: Best to wait until the daily backup duration drops by a lot,
perhaps after more efficient package archiving is implemented (as part
of repod?). The current hardlink-based archive is very disk I/O heavy
(consists of over 61 million inodes).

[1] borg repo size grew from 4.69 TiB on May 13th to 5.86 TiB
currently; trimming the archive once a year helps a bit but the upward
trend persists


Re: Archiving all package sources

2022-11-13 Thread David Runge
On 2022-11-13 17:42:27 (+0100), Jelle van der Waa wrote:
> For packaging we now rely on external parties to keep the source code
> hosted which can be a problem and some packagers want to search
> through all our packages code. [1]
> 
> Currently we already archive sources using `sourceballs` on
> repos.archlinux.org for GPL licensed packages, this is limited to a
> subset of all packages and done after the fact (A timer which runs
> every 8 hours and part of dbscripts). sourceballs calls `makepkg
> --nocolor --allsource --ignorearch --skippgpcheck`. This can be a
> problem as it runs after the package has been committed and it other
> network issues which might occur specific to the server. (ie. source
> cannot be downloaded where server is hosted)

I believe it would be good if the build tooling would take care of this
instead and release the source tarballs to the repository management
software (alongside the packages).

> To make this more robust, when committing a package using communitypkg
> or equivalent we also rsync the sources to a location on
> repos.archlinux.org (Gemini). This means the sources are consistent,
> and this opens the ability to implement a fallback or to change
> devtools to look at our sources archive when building a package. That
> would benefit reproducible builds as well and automated rebuilds.
> 
> Searching through our source code would be a next nice to have, most
> solutions such as sourcegraph/hound require a Git repository. [3] [4]
> So maybe we can hack up a repository which just git adds all directories and
> keeps one git commit? That should probably be not too much of a waste. But
> the first proposal is to first archive all our code in a way it can be
> consumed by a search solution.

If I understand this correctly, you would want to add the sources
(upstream and our additions for the build) of each package to one
repository, or each to their own?

The creation of e.g. a git repository to store the (upstream and maybe
our) sources of a package I would also see on the side of the tooling
creating packages and uploading artifacts to $place for releasing.
As the upstream tarballs contained in the source tarball that makepkg
creates are (hopefully) versioned and if we think of adding their
contents to a git repository, we need to come up with a clever solution
on how to deal with the changes over time.
But I'm not 100% sure I understood the idea for the creation of the
repository yet.

> Questions:
> 
> * How do we deal with archiving patches, PKGBUILD's etc. for GPL compliance
> (just save it next to the code?)
> * How do we determine when sources can be removed / cleaned up (we can't
> store things forever). DBscripts hooks?
> * Do we have enough disk space for archiving?

An additional question I would like to add to your set of questions is:
What do we do with e.g. binary only upstreams (we have a few) for which
we would not want to create source repos or exclude the binary blobs?


As a sidenote:
For repod I have just implemented the first basic (configurable)
archiving functionality for successfully added packages:
https://gitlab.archlinux.org/archlinux/repod/-/merge_requests/137

This does not yet extend towards source tarballs, as they are not
created by repod (also source tarballs are currently still a bit of a
backburner topic), and IMHO also should not be created by it in the
future either, but rather by the tooling that built and pushes the
artifacts into it.
FWIW, this initial functionality also does not yet concern itself with
any cleanup scenario of the archived files, but with being (in
structure) compatible with dbscripts.

When looking at (in the future) decoupling the building of source
tarballs from the software maintaining the package and source artifacts
(repod in that case), this still leaves us with a scenario in which we
need to deal with cleanup of archive directories (e.g. upload to
internet archive for long-term storage).

I see some overlap with what repod's goals are in the questions you are
bringing forward and it would be great if we could sync up on that
during the next repod meeting if you have time.

Best,
David

-- 
https://sleepmap.de


signature.asc
Description: PGP signature


Re: Archiving all package sources

2022-11-13 Thread Levente Polyak

On 11/13/22 19:37, David Runge wrote:

On 2022-11-13 17:42:27 (+0100), Jelle van der Waa wrote:

For packaging we now rely on external parties to keep the source code
hosted which can be a problem and some packagers want to search
through all our packages code. [1]

Currently we already archive sources using `sourceballs` on
repos.archlinux.org for GPL licensed packages, this is limited to a
subset of all packages and done after the fact (A timer which runs
every 8 hours and part of dbscripts). sourceballs calls `makepkg
--nocolor --allsource --ignorearch --skippgpcheck`. This can be a
problem as it runs after the package has been committed and it other
network issues which might occur specific to the server. (ie. source
cannot be downloaded where server is hosted)


I believe it would be good if the build tooling would take care of this
instead and release the source tarballs to the repository management
software (alongside the packages).



Answer merged together into next section.


To make this more robust, when committing a package using communitypkg
or equivalent we also rsync the sources to a location on
repos.archlinux.org (Gemini). This means the sources are consistent,
and this opens the ability to implement a fallback or to change
devtools to look at our sources archive when building a package. That
would benefit reproducible builds as well and automated rebuilds.

Searching through our source code would be a next nice to have, most
solutions such as sourcegraph/hound require a Git repository. [3] [4]
So maybe we can hack up a repository which just git adds all directories and
keeps one git commit? That should probably be not too much of a waste. But
the first proposal is to first archive all our code in a way it can be
consumed by a search solution.


If I understand this correctly, you would want to add the sources
(upstream and our additions for the build) of each package to one
repository, or each to their own?

The creation of e.g. a git repository to store the (upstream and maybe
our) sources of a package I would also see on the side of the tooling
creating packages and uploading artifacts to $place for releasing.
As the upstream tarballs contained in the source tarball that makepkg
creates are (hopefully) versioned and if we think of adding their
contents to a git repository, we need to come up with a clever solution
on how to deal with the changes over time.




This all sounds nice and easy on a first glace, but at
the end is a huge can of worms and we need to be aware
of the implications:

If we would tie this directly to package build tooling,
this would mean packagers packaging locally will face to
upload gigabytes of sources alongside the build
artifacts. This includes whole git repositories or huge
mono state tarballs like chromium (1.6GB).

If we go this route, we would make it very hard to
package anything with bigger sources locally (where
downloading is much less of an issue than uploading).
This route is more something that may be feasible in a
future if we had migrated to fully remote building f.e.
with buildbot.

If we want to have this rather short term, I'd recommend
we dig into how much of an issue it really would be to use decoupled 
source archiving  like we do for GPL

sources. Of cause we would have a window of opportunity
to not be able to grab the sources after 8h, but I'd
argue that would justify raising an alert to the package
maintainer and having retry mechanisms.


It's very good that Jelle is raising this question and potential issues 
with decoupled source archiving. But it

feels a bit like we obstruct ourselves moving forward
by trying to solve a (hopefully rather rare) issue of
not being able to grab upstream sources after ~8 hours.

My recommendation would be:
- try getting the decoupled way solved, including our
  storage and backup problems foutrelis pointed out.
- implement alerting if we fail to fetch sources,
  should happen rarely and it something a maintainer
  should look at
- make use of that to feed into a source indexer
  so we can already leverage the advantages
- once we reach a future where we have robots taking
  over 🤖 with more build automation, investigate
  into migrating the source archiving into the actual
  build process

Cheers,
Levente


OpenPGP_signature
Description: OpenPGP digital signature


Re: Archiving all package sources

2022-11-13 Thread Jelle van der Waa

On 13/11/2022 19:37, David Runge wrote:

On 2022-11-13 17:42:27 (+0100), Jelle van der Waa wrote:

For packaging we now rely on external parties to keep the source code
hosted which can be a problem and some packagers want to search
through all our packages code. [1]

Currently we already archive sources using `sourceballs` on
repos.archlinux.org for GPL licensed packages, this is limited to a
subset of all packages and done after the fact (A timer which runs
every 8 hours and part of dbscripts). sourceballs calls `makepkg
--nocolor --allsource --ignorearch --skippgpcheck`. This can be a
problem as it runs after the package has been committed and it other
network issues which might occur specific to the server. (ie. source
cannot be downloaded where server is hosted)


I believe it would be good if the build tooling would take care of this
instead and release the source tarballs to the repository management
software (alongside the packages).


No strong opinion here, but when would it then upload? commitpkg seems 
the most logical entrypoint, as it has access to the source code and 
PKGBUILD.

To make this more robust, when committing a package using communitypkg
or equivalent we also rsync the sources to a location on
repos.archlinux.org (Gemini). This means the sources are consistent,
and this opens the ability to implement a fallback or to change
devtools to look at our sources archive when building a package. That
would benefit reproducible builds as well and automated rebuilds.

Searching through our source code would be a next nice to have, most
solutions such as sourcegraph/hound require a Git repository. [3] [4]
So maybe we can hack up a repository which just git adds all directories and
keeps one git commit? That should probably be not too much of a waste. But
the first proposal is to first archive all our code in a way it can be
consumed by a search solution.


If I understand this correctly, you would want to add the sources
(upstream and our additions for the build) of each package to one
repository, or each to their own?


No no, I never mentioned a git repository sorry for the confusion! I 
want to make storing the upstream sources as easy as possible. Ideally 
for vim it would be:


https://sources.archlinux.org/sources/$pkgbase/whatever.tar.gz

This way we have a way to locally provide the tarball which makepkg 
expects (shasum should match!).


However, we also accept git sources like the linux PKGBUILD which kinda 
messes up my plan. So not sure how we handle that. Another issue is that 
we can have a package in staging which moves down to [extra]. So we 
should probably work out the removal process by keeping at least 3 
versions of source code. Or come up with something smarter.



The creation of e.g. a git repository to store the (upstream and maybe
our) sources of a package I would also see on the side of the tooling
creating packages and uploading artifacts to $place for releasing.
As the upstream tarballs contained in the source tarball that makepkg
creates are (hopefully) versioned and if we think of adding their
contents to a git repository, we need to come up with a clever solution
on how to deal with the changes over time.


They are versioned, example:

-rw-r--r-- 1 sourceballs sourceballs   49M May 22 13:42 
zynaddsubfx-3.0.6-3.src.tar.gz




But I'm not 100% sure I understood the idea for the creation of the
repository yet.


Storing tarballs in git is also not great, this note was only about 
searching through the packages source. Which can be solved later, we 
first need a consistent way to access our source code.



Questions:

* How do we deal with archiving patches, PKGBUILD's etc. for GPL compliance
(just save it next to the code?)
* How do we determine when sources can be removed / cleaned up (we can't
store things forever). DBscripts hooks?
* Do we have enough disk space for archiving?


An additional question I would like to add to your set of questions is:
What do we do with e.g. binary only upstreams (we have a few) for which
we would not want to create source repos or exclude the binary blobs?


So I never said I wanted the archive in a Git repo, it might be required 
for search but that's the next step. So for now binary sources are just 
as normal as source tarballs.



As a sidenote:
For repod I have just implemented the first basic (configurable)
archiving functionality for successfully added packages:
https://gitlab.archlinux.org/archlinux/repod/-/merge_requests/137


Cool!


This does not yet extend towards source tarballs, as they are not
created by repod (also source tarballs are currently still a bit of a
backburner topic), and IMHO also should not be created by it in the
future either, but rather by the tooling that built and pushes the
artifacts into it.


Agreed.


FWIW, this initial functionality also does not yet concern itself with
any cleanup scenario of the archived files, but with being (in
structure) compatible with dbscripts.


For archived p