On Tue, Nov 01, 2016 at 09:24:06PM +0100, gregor herrmann wrote: > On Sun, 30 Oct 2016 20:03:49 +0200, Niko Tyni wrote: > > > Given it fails somewhat regularly on both ci.debian.net and > > tests.reproducible-builds.org, possibly a faster machine would improve > > the chances of reproducing it. Just getting the log of 'strace -f > > -olog prove -l t/01-starter.t' when it locks up would help tremendously, > > but I ran it for two hours or so like that without a single lockup. > > I failed as well on Sunday but today I succeeded. > Attached is the output of > > while :; do strace -f -olog prove -l t/04-starter-dir.t t/05-killolddelay.t > t/06-autorestart.t || break ; done
Oh awesome, thanks! Note to self: next time ask for time stamps too (strace -ttt or so). But this will do quite fine :) The problem (at least the one visible in this trace) seems to be related to Test::TCP. The test_tcp() call in t/06-autorestart.t finds an empty port with Net::EmptyPort, then passes it to both the client and the server code. The server starts up in a child process in Test::TCP::start(), but gets EADDRINUSE when binding the listener socket for some reason. The parent process in Test::TCP::start() then hangs in Net::Empty::wait_port(), waiting for the port to become available before calling the client code but always getting ECONNREFUSED. The Server::Starter tests should probably specify a max_wait parameter to test_tcp(). That should fix at least these hangs, probably in exchange for test failures. However, I'm not sure what causes the EADDRINUSE value. Either the kernel keeps the port reserved even after it got closed (Net::EmptyPort finds a port by binding to one and then closing it immediately), or some unrelated process steals the port in between, possibly for a non-listener socket (hence ECONNREFUSED). The latter explanation feels somewhat more plausible, particularly as the hangs seem to happen more on busy hosts. This should be easy-ish to demonstrate but I'm out of time for tonight. I'm not totally convinced this is the same hang I was seeing in my earlier investigations fwiw, but it's at least a step forward :) -- Niko