Ian Jackson writes ("dgit: tests: tag2upload-oracled process leak, external lossage"): ... > I ran ps -Hfuian | grep tag2 and found only this and my emacs: > > ian 25301 1 0 Mar22 pts/198 00:03:49 /usr/bin/perl > /home/ian/things/Dgit/dgit/infra/tag2upload-oracled -D > --adt-virt=/home/ian/things/Dgit/dgit/tests/autopkgtest-virt-null-for-test > --retain-tmp --builder=builder@t2u-b --noreply=test-dummy t2u service > <nore...@example.org> --reply-to=reply...@example.org > --copies=cop...@example.org --worker-restart-timeout=4 > --ssh=/home/ian/things/Dgit/dgit/tests/ssh --manager=t2u-service-manager-host > --manager-socket=/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/o2m.s > test-dummy > /home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/distro=test-dummy > /home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dd.gpg,a:/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/t2uokr.gpg,a:/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dm.gpg,m/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dm.txt > ian 17838 1 0 Mar21 pts/51 00:04:06 /usr/bin/perl > /home/ian/things/Dgit/dgit/infra/tag2upload-oracled -D > --adt-virt=/home/ian/things/Dgit/dgit/tests/autopkgtest-virt-null-for-test > --retain-tmp --builder=builder@t2u-b --noreply=test-dummy t2u service > <nore...@example.org> --reply-to=reply...@example.org > --copies=cop...@example.org --worker-restart-timeout=4 > --ssh=/home/ian/things/Dgit/dgit/tests/ssh --manager=t2u-service-manager-host > --manager-socket=/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/o2m.s > test-dummy > /home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/distro=test-dummy > /home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dd.gpg,a:/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/t2uokr.gpg,a:/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dm.gpg,m/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dm.txt
Observations: There are two processes with different ttys. So this leakage happened more than once. I compared the command line arguments of pid 25301 sith the oracled seen in ps when running the test suite with current msster (9ca5d4dfc). They were identical. So this postdates my rewrite of the oracled management. In my local experiment, however, ps also showed a t2u-fake-manager process. I think we can therefore conclude that t2u-fake-manager had exited. t2u-fake-manager is run synchronously by the test case script. The log of a local rerun of this test case doesn't contain the word `kill` in its log, nor in its subdir of tmp/. So I think the errant signals were probably not sent by one of the test scripts, since that would show up in set -x output. (Notably, this didn't happen via end_kill_pids.) I searched the tag2upload-oracled script for `kill` and found the code whlch tries to reflect SIGTERM to its worker children. We know that oracled's worker tracking is busted: https://salsa.debian.org/dgit-team/dgit/-/issues/66 I hypothesise as follows, in one or both of these test runs: Somehow the oracled process was leaked by the test case - maybe it didn't honour ^C, or I had a broken version, or something. But fake manager *did* die. Now the oracled keeps trying to restart workers. Each worker tries to open the testbed via autopkgtest-virt-null-for-test. autopkgtest-virt-null-for-test contains a mkdir -p "$TMPDIR" that would recreate tests/tmp as I saw. Each worker would then try to connect to the manager, but there is no manager, so it would die. oracled then mistracks its children. It somehow restarts the failed worker (after a delay) but it also leaves it in the array with the pids. Later that same pid is reused by another process. Alternatively, the pid array is corrupted some other way; it would be sufficient, to trigger this bug, for it to reap children without deleting their pids from its records. Each restart (probably, many restarts since this was going on since the 22nd of March) adds a pid to the worker array that isn't a child of oracled. When I said `killall tag2upload-oracled` it sent signals to all these pids that weren't its children. This would have caused it to print error messages somewhere but probably to a long-deleted logfile. I think this would explain all the observed behaviours. I am going to investigate further. In particular, I am going to - investigate oracled's pid tracking (dgit#66). - possibly have oracled have belt and braces against mistracking - maybe have a way for oracled to detect it's been orphaned and quit - try to see out if I can repro the leaked process -- Ian Jackson <ijack...@chiark.greenend.org.uk> These opinions are my own. Pronouns: they/he. If I emailed you from @fyvzl.net or @evade.org.uk, that is a private address which bypasses my fierce spamfilter.