Ian Jackson writes ("dgit: tests: tag2upload-oracled process leak, external 
lossage"):
...
> I ran ps -Hfuian | grep tag2 and found only this and my emacs:
> 
>   ian      25301     1  0 Mar22 pts/198  00:03:49 /usr/bin/perl 
> /home/ian/things/Dgit/dgit/infra/tag2upload-oracled -D 
> --adt-virt=/home/ian/things/Dgit/dgit/tests/autopkgtest-virt-null-for-test 
> --retain-tmp --builder=builder@t2u-b --noreply=test-dummy t2u service 
> <nore...@example.org> --reply-to=reply...@example.org 
> --copies=cop...@example.org --worker-restart-timeout=4 
> --ssh=/home/ian/things/Dgit/dgit/tests/ssh --manager=t2u-service-manager-host 
> --manager-socket=/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/o2m.s
>  test-dummy 
> /home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/distro=test-dummy 
> /home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dd.gpg,a:/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/t2uokr.gpg,a:/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dm.gpg,m/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dm.txt
>   ian      17838     1  0 Mar21 pts/51   00:04:06 /usr/bin/perl 
> /home/ian/things/Dgit/dgit/infra/tag2upload-oracled -D 
> --adt-virt=/home/ian/things/Dgit/dgit/tests/autopkgtest-virt-null-for-test 
> --retain-tmp --builder=builder@t2u-b --noreply=test-dummy t2u service 
> <nore...@example.org> --reply-to=reply...@example.org 
> --copies=cop...@example.org --worker-restart-timeout=4 
> --ssh=/home/ian/things/Dgit/dgit/tests/ssh --manager=t2u-service-manager-host 
> --manager-socket=/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/o2m.s
>  test-dummy 
> /home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/distro=test-dummy 
> /home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dd.gpg,a:/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/t2uokr.gpg,a:/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dm.gpg,m/home/ian/things/Dgit/dgit/tests/tmp/tag2upload-oracled/dm.txt

Observations:

There are two processes with different ttys.  So this leakage happened
more than once.

I compared the command line arguments of pid 25301 sith the oracled
seen in ps when running the test suite with current msster
(9ca5d4dfc).  They were identical.  So this postdates my rewrite of
the oracled management.

In my local experiment, however, ps also showed a t2u-fake-manager
process.  I think we can therefore conclude that t2u-fake-manager had
exited.  t2u-fake-manager is run synchronously by the test case
script.

The log of a local rerun of this test case doesn't contain the word
`kill` in its log, nor in its subdir of tmp/.  So I think the errant
signals were probably not sent by one of the test scripts, since that
would show up in set -x output.  (Notably, this didn't happen via
end_kill_pids.)

I searched the tag2upload-oracled script for `kill` and found the code
whlch tries to reflect SIGTERM to its worker children.  We know that
oracled's worker tracking is busted:
  https://salsa.debian.org/dgit-team/dgit/-/issues/66

I hypothesise as follows, in one or both of these test runs:

Somehow the oracled process was leaked by the test case - maybe it
didn't honour ^C, or I had a broken version, or something.  But fake
manager *did* die.

Now the oracled keeps trying to restart workers.

Each worker tries to open the testbed via
autopkgtest-virt-null-for-test.  autopkgtest-virt-null-for-test
contains a mkdir -p "$TMPDIR" that would recreate tests/tmp as I saw.

Each worker would then try to connect to the manager, but there is no
manager, so it would die.

oracled then mistracks its children.  It somehow restarts the failed
worker (after a delay) but it also leaves it in the array with the
pids.  Later that same pid is reused by another process.
Alternatively, the pid array is corrupted some other way; it would be
sufficient, to trigger this bug, for it to reap children without
deleting their pids from its records.

Each restart (probably, many restarts since this was going on since
the 22nd of March) adds a pid to the worker array that isn't a child
of oracled.

When I said `killall tag2upload-oracled` it sent signals to all these
pids that weren't its children.  This would have caused it to print
error messages somewhere but probably to a long-deleted logfile.

I think this would explain all the observed behaviours.

I am going to investigate further.

In particular, I am going to

 - investigate oracled's pid tracking (dgit#66).
 - possibly have oracled have belt and braces against mistracking
 - maybe have a way for oracled to detect it's been orphaned and quit
 - try to see out if I can repro the leaked process

-- 
Ian Jackson <ijack...@chiark.greenend.org.uk>   These opinions are my own.  

Pronouns: they/he.  If I emailed you from @fyvzl.net or @evade.org.uk,
that is a private address which bypasses my fierce spamfilter.

Reply via email to