On Tue, 2025-07-15 at 12:54 +0100, Rebecca N. Palmer wrote:
> dask.distributed has both an explicit autopkgtest that is marked as 
> flaky, and a pybuild-autopkgtest that isn't, that run mostly the same
> tests.  (The git log says this is to have both a needs-internet and a
> non-needs-internet test, but the set of tests also differs because
> they 
> have different Depends.  In particular, the
> test_serialize_scipy_sparse 
> failures 
> (
> https://github.com/dask/distributed/commit/94222c0fc49c3ad14353611ecdc
> 2c699b97bf8d4) 
> are *not* part of this bug because that is only run by the 
> marked-as-flaky test.)  Both of them *already* try 5 times (see 
> debian/run-tests).

Yes. 

I found a fix for the scipy test here:
https://github.com/dask/distributed/pull/8977
and pushed it into the dask.distributed repository, it makes the
float32 errors with scipy go away.

> and/or tests/test_computations.py::test_computations_futures.
> e.g. 
> https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60904054/
> 
> - Timeout in tests/test_tls_functional.py::test_retire_workers.
> e.g. 
> https://ci.debian.net/packages/d/dask.distributed/testing/ppc64el/60794087/
> 

The other issue looks like a race condition starting a distributed
worker cluster using TLS. Sometimes initialization takes a little too
long and the test connects before the cluster is ready.

For example running one test with this loop in a sbuild chroot on an
x86_64 laptop

for a in $(seq 30 ) ; do 
  runuser -u sbuild -- \
     python3 -m pytest -k test_nanny \
       distributed/tests/test_tls_functional.py \
       --pdb --capture=no ;
done

I'd get 1-3 failures out of 30 runs, I observed either getting a
timeout error or a TLS protocol error like this.

2025-07-15 16:42:14,661 - distributed.comm.tcp - WARNING - Listener on
'tls://127.0.0.1:41241': TLS handshake failed with remote
'tls://127.0.0.1:46086': [SSL: UNEXPECTED_EOF_WHILE_READING] EOF
occurred in violation of protocol (_ssl.c:1029)

I could get test_nanny to never throw an error in the loop  by adding
await asyncio.sleep(0.1) into the function before it started trying to
use the cluster, but I'm not sure that's necessary given all the pytest
and autopkgtest configuration to rerun flaky tests.

Should I push a new dask.distributed with the scipy fix and see if the
current flaky test handling is sufficient?

Diane

Reply via email to