(disclaimer: I have never used aptitude) On Wed, Apr 17, 2024 at 03:05:47PM +0200, Christoph Anton Mitterer wrote: > May very well be an issue in APT or rather dpkg, still, since I always see it > from aptitude, I report it here. Please re-assign accordingly. > > > I'm seeing this since quite some releases and also every now and then in > unstable > (though probably far less there, as I only run my workstation on unstable, but > all servers on stable). > > When I concurrently upgrade my servers (~60) via aptitude, out of that number > arround 10 see the already running install/upgrade process suddenly > interrupted > with message like:
libapt has the tendency to produce a confusing order of messages as it doesn't print its own errors, so they tend to be printed in bulk late in the process, so let me reorder and explain the parts I can identify: > Unpacking util-linux-extra (2.38.1-5+deb12u1) over (2.38.1-5+b1) ... > Setting up util-linux-extra (2.38.1-5+deb12u1) ... > dpkg: error: dpkg frontend lock was locked by another process with pid 1064194 > Note: removing the lock file is always wrong, can damage the locked area > and the entire system. See <https://wiki.debian.org/Teams/Dpkg/FAQ#db-lock>. > E: Sub-process /usr/bin/dpkg returned an error code (2) I am assuming here that unpacking and setting up of util-linux-extra worked fine and that dpkg run ended. The dpkg run after that, which would probably have installed other things failed due to a lock being held by something else… > dpkg: error: dpkg frontend lock was locked by another process with pid 1064194 > Note: removing the lock file is always wrong, can damage the locked area > and the entire system. See <https://wiki.debian.org/Teams/Dpkg/FAQ#db-lock>. > E: Sub-process dpkg --set-selections returned an error code (2) > E: Couldn't revert dpkg selection for approved remove/purge after an error > was encountered! This is libapt trying to clean up after the first dpkg error, which fails given that (re)setting dpkg selections needs the lock, too. > Scanning processes... > Scanning processor microcode... > Scanning linux images... > Running kernel seems to be up-to-date. > The processor microcode seems to be up-to-date. > No services need to be restarted. > No containers need to be restarted. > No user sessions are running outdated binaries. > No VM guests are running outdated hypervisor (qemu) binaries on this host. This output (that I trimmed slightly) is from needrestart, it uses an apt hook (dpkg::post-invoke), that is run after libapt is done talking with all dpkg calls (regardless of the action being a success or not). The frontend lock is still active for those hooks – but they can interface with dpkg if they want to. libdvd-pkg e.g. installs a package it has just build in the same hook for example (but I think it is the only example of a package doing this in the archive) without special care. The environment of the scripts called is prepared accordingly. (That said, I think needrestart is read-only) (Now aptitude takes over from libapt again and prints the errors libapt encountered/produced) > Processing triggers for man-db (2.11.2-2) ... > Processing triggers for libc-bin (2.36-9+deb12u4) ... > Press Return to continue, 'q' followed by Return to quit. I think aptitude runs 'dpkg --configure -a' automatically if libapt ended in an error. Interestingly this just runs triggers. libapt calls dpkg with --no-tiggers all the time, but the last time to avoid running them needlessly, which supports my theory that it wanted to make other dpkg calls, but that (--unpack) call failed. > Unfortunately it doesn't tell the name of pid 1064194 and the offending > process > is typically always already gone by then. (Maybe report that as a feature request for dpkg to show some info about the pid instead of just the number, but that might be hard to implement.) > Could be check_apt from Icinga or could be > /usr/share/prometheus-node-exporter-collectors/apt_info.py > from prometheus-node-exporter-collectors . I don't know it, but a casual look suggests this is read-only and as such wouldn't need any locks? I would at least hope so based on the name "info"… > But in any case, shouldn't apitude/apt/dpkg just permantenly hold the lock > once the process has started until it finishes? That is how it is supposed to be, but I think aptitude was never changed to make full use of the frontend lock. Probably unrelated to this issue, but a quick grep on aptitude shows me: | $ git grep -A 2 -- '->ReleaseLock' src/generic/apt/aptcache.cc | :1006: apt_cache_file->ReleaseLock(); | -1007- bool dpkg_selections_saved = dpkg_selections.save_selections(); | -1008- if (! apt_cache_file->GainLock()) which is the old pattern of releasing the lock and calling dpkg in the hopes that nothing grabs it in the meantime, which was the practice before dpkg gained the frontend lock (these are aptitudes own methods that wrap _system->Lock() from libapt that does acquire the frontend and the dpkg lock – and also releases both if told so). The solution here should be to hold onto the frontend lock for the entire run and do the lock&unlock dance for compatibility with the dpkg lock only. _system->LockInner() is part of that and grep has no hits for it in aptitude. So, my suspicion is that aptitude doesn't use the frontend lock and is hence prune to other front ends grabbing the dpkg (and front end) lock the moment it releases the dpkg lock for dpkg. Hence the two fails and the run of needrestart takes long enough for the other front end to finish so that the last dpkg call aptitude makes succeeds again. Someone who knows aptitude better – or at least has more than a passing interested in aptitude – should check the code to proof the suspicions made here (or disprove them of course). Best regards David Kalnischkies
signature.asc
Description: PGP signature