Bug#1028212: prometheus-node-exporter-collectors: APT update deadlock - prevents unattended security upgrades

Antoine Beaupré Wed, 11 Oct 2023 19:27:14 -0700

On 2023-10-11 13:17:39, Kyle Fazzari wrote:
> On 10/11/23 12:08, Antoine Beaupré wrote:


[...]

> First of all, my primary use-case for this collector is to alert me of 
> updates that need to be installed (e.g. security updates). In that 
> context, operating on a stale cache would give me stale information, 
> which one could argue is worse than no information. Once is probably 
> fine, which is why that script silently proceeds if there is a problem 
> updating the cache, but actually adding the ability to disable the cache 
> update (as Daniel suggests as a potential solution) seems like it would 
> be a footgun.

Yep, that makes sense.

> Second, I must admit I'm also rather flummoxed by this behavior. Earlier 
> in the thread, Daniel said:
>
>  > I personally run this textfile collector on a Debian bookworm system,
>  > as well as apticron - so this is (I think) a similar scenario where
>  > two  independent processes are periodically updating the apt cache,
>  > and I  wondered whether that was wise or not
>
> I don't claim to be an apt expert, but I believe apt uses locks to 
> prevent the scenario where this would be problematic, no? You can't run 
> two `apt update`s at the same time: one fails because it can't get the 
> lock. Similarly, you can't run one `apt update` at the same time as this 
> Python script:
>
> import apt
> cache = apt.cache.Cache()
> cache.update()
>
> The Python script uses the same lock, and fails the same way. You can't 
> run two instances of that script ^ at the same time, either. This is why 
> we catch the `apt.cache.LockFailedException` error in the script. If the 
> deadlock is really happening on the `cache.update()` line, that feels 
> like an apt bug. Obviously jak will know better.

I don't think this is a deadlock, but I admit I haven't looked at the
code. At least the gdb backtraces don't show the hung process as
spinning on a lock, it's waiting on a download:

   #0  0x00007f4a6601d744 in select () from /lib/x86_64-linux-gnu/libc.so.6
   #1  0x00007f4a65a93508 in pkgAcquire::Run (this=this@entry=0x7ffdf60fa4c0, 
PulseInterval=PulseInterval@entry=500000) at ./apt-pkg/acquire.cc:761

That's clearly some thread waiting on the network. What I feel is
happening is that there's some timeout in `apt update` that exists and
doesn't trigger in `apt.cache.update()`.

> One off-the-cuff idea is that we could probably use a custom apt config 
> to run the checks on our own copy of the cache without interacting with 
> the system's cache, but that will greatly complicate this script.

Yeah, really, the script you wrote should Just Work. I find the
`cache.upgrade()` call to be a little strange, personnally: I would try
ripping that out completely to see if it fixes the issue, but maybe you
have a better idea of why it's there in the first place?

For now, we're using a timeout mitigation in the systemd timer
instead. I've received two more warnings from the other cron job, but
hopefully this will go away.

But because the `apt_info.py` can silently fail to update the cache, we
may want to add an extra metric to track the update timestamp on the
mirror info, I filed this bug about that:

https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/180

> Anyway, I'll do some experimentation and see if I can develop some 
> properly-formed thoughts.

Thank you so much for your response!

I think adding instrumentation around how long the script takes to run
itself would be valuable, that could be a simple time counter added to
the script's output... This would allow tracking this problem in fleets
where there *isn't* such lock contention.

After all, the only reason we found out about this is because we got
repeated emails from cron about apticron or other software failing to
run apt-update. If that's removed from the equation, the script here
just fails silently, and I think that's also possibly a Bad Thing.

a.

-- 
If ease of use was the ultimate aim for a tool,
the bicycle would never have evolved beyond the tricycle.
                        — Doug Engelbart, 1925-2013

Bug#1028212: prometheus-node-exporter-collectors: APT update deadlock - prevents unattended security upgrades

Reply via email to