On Tue, 2025-07-15 at 22:08 +0100, Rebecca N. Palmer wrote:
> 
> > I could get test_nanny to never throw an error in the loop  by
> > adding
> > await asyncio.sleep(0.1) into the function before it started trying
> > to
> > use the cluster
> 
> That's plausibly a better idea, but I haven't tried it.

Unfortunately it doesn't work on the ppc64el porterbox
platti.debian.org

I still get a fair number of timeout test failures when running
test_nanny in a loop.

I did manage to capture a little bit earlier from where the logs start
to look different for a failed case.

When it's about to fail I get something like the following, the first
error is the worker-handle-scheduler-connection-broken. Then it waits
for the nanny to shutdown and after that times out it starts to spew
stack traces and complains about 0 byte TLS responses.

2025-07-15 22:25:26,920 - distributed.core - INFO - Connection to
tls://127.0.0.1:51942 has been closed.
2025-07-15 22:25:26,920 - distributed.scheduler - INFO - Remove worker
addr: tls://127.0.0.1:46743 name: 0 (stimulus_id='handle-worker-
cleanup-1752618326.9205813')                                          
2025-07-15 22:25:26,921 - distributed.core - INFO - Starting
established connection to tls://127.0.0.1:42399
2025-07-15 22:25:26,922 - distributed.core - INFO - Connection to
tls://127.0.0.1:42399 has been closed.
2025-07-15 22:25:26,922 - distributed.worker - INFO - Stopping worker
at tls://127.0.0.1:46743. Reason: worker-handle-scheduler-connection-
broken
2025-07-15 22:25:26,969 - distributed.nanny - INFO - Closing Nanny
gracefully at 'tls://127.0.0.1:35207'. Reason: worker-handle-scheduler-
connection-broken
2025-07-15 22:25:26,970 - distributed.worker - INFO - Removing Worker
plugin shuffle
2025-07-15 22:25:26,971 - distributed.nanny - INFO - Worker closed
2025-07-15 22:25:28,974 - distributed.nanny - ERROR - Worker process
died unexpectedly
2025-07-15 22:25:29,076 - distributed.nanny - INFO - Closing Nanny at
'tls://127.0.0.1:35207'. Reason: nanny-close-gracefully
2025-07-15 22:25:29,077 - distributed.nanny - INFO - Nanny at
'tls://127.0.0.1:35207' closed.

Reply via email to