Bug#1056171: librocblas0-tests: rocblas-test alarm timeout on slow hosts

Cordell Bloor Sun, 19 Nov 2023 00:57:15 -0800

I have badly misdiagnosed this problem.

On 2023-11-18 00:52, Cordell Bloor wrote:

The rocblas-test executable sets a five-second alarm signal before it
executes some tests. If the alarm goes off before the test completes,
rocblas-test will abort, under the assumption that there was deadlock
that prevented the test from completing.

The default alarm timeout for tests is 500 seconds. There is a 5 secondalarm signal used for the fancy multithreaded logging system inrocblas-test, but that's not the alarm that was triggered. The logsclearly show the test was running for ~500 seconds.

On slow hosts, such as lyra.rocm.debian.net, the timeout set for the
alarm is insufficient to complete the test even when everything is
functioning normally. This problem can be observed in the test logs for
amd64+gfx900 [1].

The problem wasn't observed on my MI25 test system a few months ago [2].However, I was wrong in believing this discrepancy was because Lyra isslow. When I ran the tests manually on Lyra in a qemu container, Iobserved the exact same behaviour, but could see that there was anamdgpu driver timeout that caused a GPU reset. This occurred at exactlythe same point in the test suite as on the CI.

My question is now whether this is specific to Lyra or if it applies toall systems with Vega 10 GPUs.

There is no single value that would be appropriate for the alarm timeout
on every machine, so the timeout should either be configurable at
runtime or entirely removed from the rocblas-test utility.

The timeout can be configured by setting the environment variableROCBLAS_TEST_TIMEOUT=<seconds> or disabled by settingROCBLAS_TEST_TIMEOUT=0.

Sincerely,
Cory Bloor

[1]:https://ci.rocm.debian.net/data/autopkgtest/testing/amd64+gfx900/r/rocblas/913/log.gz

[2]: https://slerp.xyz/rocm/logs/full/2023-08-22-gfx900.log

Bug#1056171: librocblas0-tests: rocblas-test alarm timeout on slow hosts

Reply via email to