On 2023-10-11 13:17:39, Kyle Fazzari wrote: > On 10/11/23 12:08, Antoine Beaupré wrote:
[...] > First of all, my primary use-case for this collector is to alert me of > updates that need to be installed (e.g. security updates). In that > context, operating on a stale cache would give me stale information, > which one could argue is worse than no information. Once is probably > fine, which is why that script silently proceeds if there is a problem > updating the cache, but actually adding the ability to disable the cache > update (as Daniel suggests as a potential solution) seems like it would > be a footgun. Yep, that makes sense. > Second, I must admit I'm also rather flummoxed by this behavior. Earlier > in the thread, Daniel said: > > > I personally run this textfile collector on a Debian bookworm system, > > as well as apticron - so this is (I think) a similar scenario where > > two independent processes are periodically updating the apt cache, > > and I wondered whether that was wise or not > > I don't claim to be an apt expert, but I believe apt uses locks to > prevent the scenario where this would be problematic, no? You can't run > two `apt update`s at the same time: one fails because it can't get the > lock. Similarly, you can't run one `apt update` at the same time as this > Python script: > > import apt > cache = apt.cache.Cache() > cache.update() > > The Python script uses the same lock, and fails the same way. You can't > run two instances of that script ^ at the same time, either. This is why > we catch the `apt.cache.LockFailedException` error in the script. If the > deadlock is really happening on the `cache.update()` line, that feels > like an apt bug. Obviously jak will know better. I don't think this is a deadlock, but I admit I haven't looked at the code. At least the gdb backtraces don't show the hung process as spinning on a lock, it's waiting on a download: #0 0x00007f4a6601d744 in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f4a65a93508 in pkgAcquire::Run (this=this@entry=0x7ffdf60fa4c0, PulseInterval=PulseInterval@entry=500000) at ./apt-pkg/acquire.cc:761 That's clearly some thread waiting on the network. What I feel is happening is that there's some timeout in `apt update` that exists and doesn't trigger in `apt.cache.update()`. > One off-the-cuff idea is that we could probably use a custom apt config > to run the checks on our own copy of the cache without interacting with > the system's cache, but that will greatly complicate this script. Yeah, really, the script you wrote should Just Work. I find the `cache.upgrade()` call to be a little strange, personnally: I would try ripping that out completely to see if it fixes the issue, but maybe you have a better idea of why it's there in the first place? For now, we're using a timeout mitigation in the systemd timer instead. I've received two more warnings from the other cron job, but hopefully this will go away. But because the `apt_info.py` can silently fail to update the cache, we may want to add an extra metric to track the update timestamp on the mirror info, I filed this bug about that: https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/180 > Anyway, I'll do some experimentation and see if I can develop some > properly-formed thoughts. Thank you so much for your response! I think adding instrumentation around how long the script takes to run itself would be valuable, that could be a simple time counter added to the script's output... This would allow tracking this problem in fleets where there *isn't* such lock contention. After all, the only reason we found out about this is because we got repeated emails from cron about apticron or other software failing to run apt-update. If that's removed from the equation, the script here just fails silently, and I think that's also possibly a Bad Thing. a. -- If ease of use was the ultimate aim for a tool, the bicycle would never have evolved beyond the tricycle. — Doug Engelbart, 1925-2013