On Tue, 2025-07-15 at 12:54 +0100, Rebecca N. Palmer wrote: > dask.distributed has both an explicit autopkgtest that is marked as > flaky, and a pybuild-autopkgtest that isn't, that run mostly the same > tests. (The git log says this is to have both a needs-internet and a > non-needs-internet test, but the set of tests also differs because > they > have different Depends. In particular, the > test_serialize_scipy_sparse > failures > ( > https://github.com/dask/distributed/commit/94222c0fc49c3ad14353611ecdc > 2c699b97bf8d4) > are *not* part of this bug because that is only run by the > marked-as-flaky test.) Both of them *already* try 5 times (see > debian/run-tests).
Yes. I found a fix for the scipy test here: https://github.com/dask/distributed/pull/8977 and pushed it into the dask.distributed repository, it makes the float32 errors with scipy go away. > and/or tests/test_computations.py::test_computations_futures. > e.g. > https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60904054/ > > - Timeout in tests/test_tls_functional.py::test_retire_workers. > e.g. > https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60794087/ > The other issue looks like a race condition starting a distributed worker cluster using TLS. Sometimes initialization takes a little too long and the test connects before the cluster is ready. For example running one test with this loop in a sbuild chroot on an x86_64 laptop for a in $(seq 30 ) ; do runuser -u sbuild -- \ python3 -m pytest -k test_nanny \ distributed/tests/test_tls_functional.py \ --pdb --capture=no ; done I'd get 1-3 failures out of 30 runs, I observed either getting a timeout error or a TLS protocol error like this. 2025-07-15 16:42:14,661 - distributed.comm.tcp - WARNING - Listener on 'tls://127.0.0.1:41241': TLS handshake failed with remote 'tls://127.0.0.1:46086': [SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1029) I could get test_nanny to never throw an error in the loop by adding await asyncio.sleep(0.1) into the function before it started trying to use the cluster, but I'm not sure that's necessary given all the pytest and autopkgtest configuration to rerun flaky tests. Should I push a new dask.distributed with the scipy fix and see if the current flaky test handling is sufficient? Diane