(adding the collector author and python3-apt author in cc) (context for you fine folks: the Prometheus monitoring system (package prometheus-node-exporter-collectors) has a systemd timer that runs every 5 minutes and seems to sometimes hang for hours on end. The script was written by Kyle Fazzari (in cc) using the python-apt library written by jak (also in cc). I'm trying to figure out if it's using the library correctly, or if there's a bug in said library.)
On 2023-10-11 14:31:51, Antoine Beaupré wrote: > Surely there should be a timeout to this script or something? Why does > it hang in the first place? I'll investigate a little bit further. So that is what the Python script does: def _main(): cache = apt.cache.Cache() # First of all, attempt to update the index. If we don't have permission # to do so (or it fails for some reason), it's not the end of the world, # we'll operate on the old index. with contextlib.suppress(apt.cache.LockFailedException, apt.cache.FetchFailedException): cache.update() cache.open() cache.upgrade(True) _write_pending_upgrades(cache) _write_held_upgrades(cache) _write_autoremove_pending(cache) _write_reboot_required() That's the version in bookworm, the latest upstream is similar: https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/blob/34dd42ee2cf5bf1ffcfdea5a3599130f146b88fc/apt_info.py#L93 The _write_*() calls are all things that parse the cache and spew out metrics. I believe they are irrelevant to this bug. If I look at a stuck server, the process list looks like this: root 101969 0.0 0.0 4388 3140 ? Ss 10:00 0:00 /bin/bash -c /usr/share/prometheus-node-exporter-collectors/apt_info.py | sponge /var/lib/prometheus root 101971 0.0 0.9 88124 78232 ? S 10:00 0:04 \_ python3 /usr/share/prometheus-node-exporter-collectors/apt_info.py _apt 101988 0.0 0.1 21216 11700 ? S 10:00 0:00 | \_ /usr/lib/apt/methods/https _apt 101989 0.0 0.1 21164 10696 ? S 10:00 0:00 | \_ /usr/lib/apt/methods/http _apt 101990 0.0 0.1 24336 14196 ? S 10:00 0:00 | \_ /usr/lib/apt/methods/https _apt 101992 0.0 0.0 12424 6436 ? S 10:00 0:00 | \_ /usr/lib/apt/methods/gpgv root 101972 0.0 0.0 2464 908 ? S 10:00 0:00 \_ sponge /var/lib/prometheus/node-exporter/apt.prom That seems to be the APT `gpgv`, `http` and `https` methods stuck there... This particular one has been stuck for 4 hours at this point. The network processes have a socket open with the remote HTTP server in CLOSE_WAIT state. Obviously, we don't have this problem with other `apt` incantations. So something must be wrong with the way we're calling apt here... It looks like it's doing this: bool res = ListUpdate(progress, *source, pulseInterval); here: https://salsa.debian.org/apt-team/python-apt/-/blob/bdcb2550cb6f623f4556bf6581a040642f29dd28/python/cache.cc#L158 while apt itself just calls it as: if (!ListUpdate(*acquire, *m_cache->GetSourceList())) { in: https://sources.debian.org/src/libqapt/3.0.5-2/src/worker/aptworker.cpp/?hl=311#L311 ... but that looks like a different interface, perhaps, than the previous one. Here's a backtrace with debugging symbols attached for libapt: (gdb) bt #0 0x00007f4a6601d744 in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f4a65a93508 in pkgAcquire::Run (this=this@entry=0x7ffdf60fa4c0, PulseInterval=PulseInterval@entry=500000) at ./apt-pkg/acquire.cc:761 #2 0x00007f4a65b97935 in AcquireUpdate (Fetcher=..., PulseInterval=PulseInterval@entry=0, RunUpdateScripts=RunUpdateScripts@entry=true, ListCleanup=ListCleanup@entry=true) at ./apt-pkg/update.cc:71 #3 0x00007f4a65b97e26 in ListUpdate (Stat=..., List=..., PulseInterval=0) at ./apt-pkg/update.cc:37 #4 0x00007f4a65c18fe1 in ?? () from /usr/lib/python3/dist-packages/apt_pkg.cpython-311-x86_64-linux-gnu.so #5 0x00000000005230d0 in ?? () #6 0x000000000053ac2c in PyObject_Vectorcall () #7 0x000000000052b940 in _PyEval_EvalFrameDefault () #8 0x000000000052360b in PyEval_EvalCode () #9 0x0000000000647497 in ?? () #10 0x0000000000644d4f in ?? () #11 0x0000000000651010 in ?? () #12 0x0000000000650d5b in _PyRun_SimpleFileObject () #13 0x0000000000650b84 in _PyRun_AnyFileObject () #14 0x000000000064f90f in Py_RunMain () #15 0x00000000006275c7 in Py_BytesMain () #16 0x00007f4a65f461ca in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #17 0x00007f4a65f46285 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6 #18 0x0000000000627461 in _start () (gdb) up #1 0x00007f4a65a93508 in pkgAcquire::Run (this=this@entry=0x7ffdf60fa4c0, PulseInterval=PulseInterval@entry=500000) at ./apt-pkg/acquire.cc:761 761 in ./apt-pkg/acquire.cc (gdb) p (pkgAcquire const) *0x7ffdf60fa4c0 $5 = {_vptr.pkgAcquire = 0x7f4a65bf0380 <vtable for pkgAcquire+16>, LockFD = 26, d = 0x0, Items = std::vector of length 49, capacity 64 = {0x2465f50, 0x248dd00, 0x243ea90, 0x243f920, 0x2467520, 0x2479040, 0x24817b0, 0x222b3e0, 0x2482180, 0x21bbe00, 0x21c4840, 0x1f97c60, 0x1f33e50, 0x24869d0, 0x2487ac0, 0x249bea0, 0x249caa0, 0x249dd40, 0x24a0ce0, 0x24a0f40, 0x24a55e0, 0x24a62f0, 0x24a6ed0, 0x24a80c0, 0x24a8ca0, 0x24a9ee0, 0x24aabf0, 0x24a3350, 0x24a3910, 0x24a4080, 0x24ab7f0, 0x24ac7a0, 0x24ad480, 0x24e8e30, 0x24e9c00, 0x24ea880, 0x24eb3d0, 0x24ec3a0, 0x24ecfb0, 0x24f0d10, 0x24f4ec0, 0x24f5520, 0x24f6130, 0x24f7260, 0x24f8430, 0x24f95b0, 0x24fa710, 0x24fb800, 0x216e9b0}, Queues = 0x248d900, Workers = 0x223c580, Configs = 0x2246950, Log = 0x7ffdf60fa5a0, ToFetch = 2, QueueMode = pkgAcquire::QueueHost, Debug = false, Running = true} (gdb) up #2 0x00007f4a65b97935 in AcquireUpdate (Fetcher=..., PulseInterval=PulseInterval@entry=0, RunUpdateScripts=RunUpdateScripts@entry=true, ListCleanup=ListCleanup@entry=true) at ./apt-pkg/update.cc:71 71 ./apt-pkg/update.cc: No such file or directory. (gdb) p Fetcher $2 = (pkgAcquire &) @0x7ffdf60fa4c0: {_vptr.pkgAcquire = 0x7f4a65bf0380 <vtable for pkgAcquire+16>, LockFD = 26, d = 0x0, Items = std::vector of length 49, capacity 64 = {0x2465f50, 0x248dd00, 0x243ea90, 0x243f920, 0x2467520, 0x2479040, 0x24817b0, 0x222b3e0, 0x2482180, 0x21bbe00, 0x21c4840, 0x1f97c60, 0x1f33e50, 0x24869d0, 0x2487ac0, 0x249bea0, 0x249caa0, 0x249dd40, 0x24a0ce0, 0x24a0f40, 0x24a55e0, 0x24a62f0, 0x24a6ed0, 0x24a80c0, 0x24a8ca0, 0x24a9ee0, 0x24aabf0, 0x24a3350, 0x24a3910, 0x24a4080, 0x24ab7f0, 0x24ac7a0, 0x24ad480, 0x24e8e30, 0x24e9c00, 0x24ea880, 0x24eb3d0, 0x24ec3a0, 0x24ecfb0, 0x24f0d10, 0x24f4ec0, 0x24f5520, 0x24f6130, 0x24f7260, 0x24f8430, 0x24f95b0, 0x24fa710, 0x24fb800, 0x216e9b0}, Queues = 0x248d900, Workers = 0x223c580, Configs = 0x2246950, Log = 0x7ffdf60fa5a0, ToFetch = 2, QueueMode = pkgAcquire::QueueHost, Debug = false, Running = true} Note that killing the process and restarting it fixes the issue completely for us, as documented elsewhere. So I'm a bit stuck as well. I can't help but think that the previous shell script, as horrendous as it was, Just Worked for our purpose... In any case, this is an important regression for us right now.