I have badly misdiagnosed this problem.
On 2023-11-18 00:52, Cordell Bloor wrote:
The rocblas-test executable sets a five-second alarm signal before it
executes some tests. If the alarm goes off before the test completes,
rocblas-test will abort, under the assumption that there was deadlock
that prevented the test from completing.
The default alarm timeout for tests is 500 seconds. There is a 5 second
alarm signal used for the fancy multithreaded logging system in
rocblas-test, but that's not the alarm that was triggered. The logs
clearly show the test was running for ~500 seconds.
On slow hosts, such as lyra.rocm.debian.net, the timeout set for the
alarm is insufficient to complete the test even when everything is
functioning normally. This problem can be observed in the test logs for
amd64+gfx900 [1].
The problem wasn't observed on my MI25 test system a few months ago [2].
However, I was wrong in believing this discrepancy was because Lyra is
slow. When I ran the tests manually on Lyra in a qemu container, I
observed the exact same behaviour, but could see that there was an
amdgpu driver timeout that caused a GPU reset. This occurred at exactly
the same point in the test suite as on the CI.
My question is now whether this is specific to Lyra or if it applies to
all systems with Vega 10 GPUs.
There is no single value that would be appropriate for the alarm timeout
on every machine, so the timeout should either be configurable at
runtime or entirely removed from the rocblas-test utility.
The timeout can be configured by setting the environment variable
ROCBLAS_TEST_TIMEOUT=<seconds> or disabled by setting
ROCBLAS_TEST_TIMEOUT=0.
Sincerely,
Cory Bloor
[1]:https://ci.rocm.debian.net/data/autopkgtest/testing/amd64+gfx900/r/rocblas/913/log.gz
[2]: https://slerp.xyz/rocm/logs/full/2023-08-22-gfx900.log