On Tue, 2025-07-15 at 22:08 +0100, Rebecca N. Palmer wrote: > > > I could get test_nanny to never throw an error in the loop by > > adding > > await asyncio.sleep(0.1) into the function before it started trying > > to > > use the cluster > > That's plausibly a better idea, but I haven't tried it.
Unfortunately it doesn't work on the ppc64el porterbox platti.debian.org I still get a fair number of timeout test failures when running test_nanny in a loop. I did manage to capture a little bit earlier from where the logs start to look different for a failed case. When it's about to fail I get something like the following, the first error is the worker-handle-scheduler-connection-broken. Then it waits for the nanny to shutdown and after that times out it starts to spew stack traces and complains about 0 byte TLS responses. 2025-07-15 22:25:26,920 - distributed.core - INFO - Connection to tls://127.0.0.1:51942 has been closed. 2025-07-15 22:25:26,920 - distributed.scheduler - INFO - Remove worker addr: tls://127.0.0.1:46743 name: 0 (stimulus_id='handle-worker- cleanup-1752618326.9205813') 2025-07-15 22:25:26,921 - distributed.core - INFO - Starting established connection to tls://127.0.0.1:42399 2025-07-15 22:25:26,922 - distributed.core - INFO - Connection to tls://127.0.0.1:42399 has been closed. 2025-07-15 22:25:26,922 - distributed.worker - INFO - Stopping worker at tls://127.0.0.1:46743. Reason: worker-handle-scheduler-connection- broken 2025-07-15 22:25:26,969 - distributed.nanny - INFO - Closing Nanny gracefully at 'tls://127.0.0.1:35207'. Reason: worker-handle-scheduler- connection-broken 2025-07-15 22:25:26,970 - distributed.worker - INFO - Removing Worker plugin shuffle 2025-07-15 22:25:26,971 - distributed.nanny - INFO - Worker closed 2025-07-15 22:25:28,974 - distributed.nanny - ERROR - Worker process died unexpectedly 2025-07-15 22:25:29,076 - distributed.nanny - INFO - Closing Nanny at 'tls://127.0.0.1:35207'. Reason: nanny-close-gracefully 2025-07-15 22:25:29,077 - distributed.nanny - INFO - Nanny at 'tls://127.0.0.1:35207' closed.