(adding the collector author and python3-apt author in cc)

(context for you fine folks: the Prometheus monitoring system (package
prometheus-node-exporter-collectors) has a systemd timer that runs every
5 minutes and seems to sometimes hang for hours on end. The script was
written by Kyle Fazzari (in cc) using the python-apt library written by
jak (also in cc). I'm trying to figure out if it's using the library
correctly, or if there's a bug in said library.)

On 2023-10-11 14:31:51, Antoine Beaupré wrote:
> Surely there should be a timeout to this script or something? Why does
> it hang in the first place? I'll investigate a little bit further.

So that is what the Python script does:

def _main():
    cache = apt.cache.Cache()

    # First of all, attempt to update the index. If we don't have permission
    # to do so (or it fails for some reason), it's not the end of the world,
    # we'll operate on the old index.
    with contextlib.suppress(apt.cache.LockFailedException, 
apt.cache.FetchFailedException):
        cache.update()

    cache.open()
    cache.upgrade(True)
    _write_pending_upgrades(cache)
    _write_held_upgrades(cache)
    _write_autoremove_pending(cache)
    _write_reboot_required()

That's the version in bookworm, the latest upstream is similar:

https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/blob/34dd42ee2cf5bf1ffcfdea5a3599130f146b88fc/apt_info.py#L93

The _write_*() calls are all things that parse the cache and spew out
metrics. I believe they are irrelevant to this bug.

If I look at a stuck server, the process list looks like this:

root      101969  0.0  0.0   4388  3140 ?        Ss   10:00   0:00 /bin/bash -c 
/usr/share/prometheus-node-exporter-collectors/apt_info.py | sponge 
/var/lib/prometheus
root      101971  0.0  0.9  88124 78232 ?        S    10:00   0:04  \_ python3 
/usr/share/prometheus-node-exporter-collectors/apt_info.py
_apt      101988  0.0  0.1  21216 11700 ?        S    10:00   0:00  |   \_ 
/usr/lib/apt/methods/https
_apt      101989  0.0  0.1  21164 10696 ?        S    10:00   0:00  |   \_ 
/usr/lib/apt/methods/http
_apt      101990  0.0  0.1  24336 14196 ?        S    10:00   0:00  |   \_ 
/usr/lib/apt/methods/https
_apt      101992  0.0  0.0  12424  6436 ?        S    10:00   0:00  |   \_ 
/usr/lib/apt/methods/gpgv
root      101972  0.0  0.0   2464   908 ?        S    10:00   0:00  \_ sponge 
/var/lib/prometheus/node-exporter/apt.prom

That seems to be the APT `gpgv`, `http` and `https` methods stuck
there... This particular one has been stuck for 4 hours at this
point. The network processes have a socket open with the remote HTTP
server in CLOSE_WAIT state.

Obviously, we don't have this problem with other `apt`
incantations. So something must be wrong with the way we're calling apt
here... It looks like it's doing this:

   bool res = ListUpdate(progress, *source, pulseInterval);

here:

https://salsa.debian.org/apt-team/python-apt/-/blob/bdcb2550cb6f623f4556bf6581a040642f29dd28/python/cache.cc#L158

while apt itself just calls it as:

    if (!ListUpdate(*acquire, *m_cache->GetSourceList())) {

in:

https://sources.debian.org/src/libqapt/3.0.5-2/src/worker/aptworker.cpp/?hl=311#L311

... but that looks like a different interface, perhaps, than the
previous one.

Here's a backtrace with debugging symbols attached for libapt:

(gdb) bt
#0  0x00007f4a6601d744 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f4a65a93508 in pkgAcquire::Run (this=this@entry=0x7ffdf60fa4c0, 
PulseInterval=PulseInterval@entry=500000) at ./apt-pkg/acquire.cc:761
#2  0x00007f4a65b97935 in AcquireUpdate (Fetcher=..., 
PulseInterval=PulseInterval@entry=0, 
RunUpdateScripts=RunUpdateScripts@entry=true, 
    ListCleanup=ListCleanup@entry=true) at ./apt-pkg/update.cc:71
#3  0x00007f4a65b97e26 in ListUpdate (Stat=..., List=..., PulseInterval=0) at 
./apt-pkg/update.cc:37
#4  0x00007f4a65c18fe1 in ?? () from 
/usr/lib/python3/dist-packages/apt_pkg.cpython-311-x86_64-linux-gnu.so
#5  0x00000000005230d0 in ?? ()
#6  0x000000000053ac2c in PyObject_Vectorcall ()
#7  0x000000000052b940 in _PyEval_EvalFrameDefault ()
#8  0x000000000052360b in PyEval_EvalCode ()
#9  0x0000000000647497 in ?? ()
#10 0x0000000000644d4f in ?? ()
#11 0x0000000000651010 in ?? ()
#12 0x0000000000650d5b in _PyRun_SimpleFileObject ()
#13 0x0000000000650b84 in _PyRun_AnyFileObject ()
#14 0x000000000064f90f in Py_RunMain ()
#15 0x00000000006275c7 in Py_BytesMain ()
#16 0x00007f4a65f461ca in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x00007f4a65f46285 in __libc_start_main () from 
/lib/x86_64-linux-gnu/libc.so.6
#18 0x0000000000627461 in _start ()

(gdb) up
#1  0x00007f4a65a93508 in pkgAcquire::Run (this=this@entry=0x7ffdf60fa4c0, 
PulseInterval=PulseInterval@entry=500000) at ./apt-pkg/acquire.cc:761
761     in ./apt-pkg/acquire.cc
(gdb) p (pkgAcquire const) *0x7ffdf60fa4c0
$5 = {_vptr.pkgAcquire = 0x7f4a65bf0380 <vtable for pkgAcquire+16>, LockFD = 
26, d = 0x0, Items = std::vector of length 49, capacity 64 = {0x2465f50, 
0x248dd00, 
    0x243ea90, 0x243f920, 0x2467520, 0x2479040, 0x24817b0, 0x222b3e0, 
0x2482180, 0x21bbe00, 0x21c4840, 0x1f97c60, 0x1f33e50, 0x24869d0, 0x2487ac0, 
0x249bea0, 
    0x249caa0, 0x249dd40, 0x24a0ce0, 0x24a0f40, 0x24a55e0, 0x24a62f0, 
0x24a6ed0, 0x24a80c0, 0x24a8ca0, 0x24a9ee0, 0x24aabf0, 0x24a3350, 0x24a3910, 
0x24a4080, 
    0x24ab7f0, 0x24ac7a0, 0x24ad480, 0x24e8e30, 0x24e9c00, 0x24ea880, 
0x24eb3d0, 0x24ec3a0, 0x24ecfb0, 0x24f0d10, 0x24f4ec0, 0x24f5520, 0x24f6130, 
0x24f7260, 
    0x24f8430, 0x24f95b0, 0x24fa710, 0x24fb800, 0x216e9b0}, Queues = 0x248d900, 
Workers = 0x223c580, Configs = 0x2246950, Log = 0x7ffdf60fa5a0, ToFetch = 2, 
  QueueMode = pkgAcquire::QueueHost, Debug = false, Running = true}
(gdb) up
#2  0x00007f4a65b97935 in AcquireUpdate (Fetcher=..., 
PulseInterval=PulseInterval@entry=0, 
RunUpdateScripts=RunUpdateScripts@entry=true, 
    ListCleanup=ListCleanup@entry=true) at ./apt-pkg/update.cc:71
71      ./apt-pkg/update.cc: No such file or directory.
(gdb) p Fetcher
$2 = (pkgAcquire &) @0x7ffdf60fa4c0: {_vptr.pkgAcquire = 0x7f4a65bf0380 <vtable 
for pkgAcquire+16>, LockFD = 26, d = 0x0, 
  Items = std::vector of length 49, capacity 64 = {0x2465f50, 0x248dd00, 
0x243ea90, 0x243f920, 0x2467520, 0x2479040, 0x24817b0, 0x222b3e0, 0x2482180, 
0x21bbe00, 
    0x21c4840, 0x1f97c60, 0x1f33e50, 0x24869d0, 0x2487ac0, 0x249bea0, 
0x249caa0, 0x249dd40, 0x24a0ce0, 0x24a0f40, 0x24a55e0, 0x24a62f0, 0x24a6ed0, 
0x24a80c0, 
    0x24a8ca0, 0x24a9ee0, 0x24aabf0, 0x24a3350, 0x24a3910, 0x24a4080, 
0x24ab7f0, 0x24ac7a0, 0x24ad480, 0x24e8e30, 0x24e9c00, 0x24ea880, 0x24eb3d0, 
0x24ec3a0, 
    0x24ecfb0, 0x24f0d10, 0x24f4ec0, 0x24f5520, 0x24f6130, 0x24f7260, 
0x24f8430, 0x24f95b0, 0x24fa710, 0x24fb800, 0x216e9b0}, Queues = 0x248d900, 
  Workers = 0x223c580, Configs = 0x2246950, Log = 0x7ffdf60fa5a0, ToFetch = 2, 
QueueMode = pkgAcquire::QueueHost, Debug = false, Running = true}

Note that killing the process and restarting it fixes the issue
completely for us, as documented elsewhere.

So I'm a bit stuck as well. I can't help but think that the previous
shell script, as horrendous as it was, Just Worked for our purpose...

In any case, this is an important regression for us right now.

Reply via email to