On 2024-09-28 01:11, Cordell Bloor wrote: > "Killed" disappeared when I ran it myself in both cases. However, it did > get further with vm.overcommit_memory=0:
Hm, odd. The OOM killer does kill rocfft-test, from dmesg: > [ 633.776686] Out of memory: Killed process 4053 (rocfft-test) and I would assume that this would be logged with "Killed" again. Regardless: > [ RUN ] > pow2_1D/accuracy_test.vs_fftw/complex_forward_len_134217728_single_op_batch_1_istride_1_CI_ostride_1_CI_idist_134217728_odist_134217728_ioffset_0_0_ooffset_0_0 > [ OK ] > pow2_1D/accuracy_test.vs_fftw/complex_forward_len_134217728_single_op_batch_1_istride_1_CI_ostride_1_CI_idist_134217728_odist_134217728_ioffset_0_0_ooffset_0_0 > (1771 ms) > [ RUN ] > pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_double_ip_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0 > clients/tests/accuracy_test.h:1214: Skipped > needed_ramgb: 96, ramgb limit: 61. This is the red flag right there, the test believes it has 61GiB of memory available (and it's skipped because it needs 96GiB). > [ SKIPPED ] > pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_double_ip_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0 > (0 ms) Now if the system has 64GiB physical and the GPU at some point is using 32GiB of it, then of course the assumption of having 61GiB virtual available and attempting to use more than ~32GiB of it will eventually lead to failure. I'm not yet familiar with how the driver dynamically allocates memory in 6.10, but an obvious way to run into this error is to query memory parameters at test start (GPU: 32GiB, System: 61GiB) and to assume that these are static. And grepping for "ramgb", it seems that this is exactly what is happening [8]. I don't think this is necessarily a bug in rocfft's tests, as this assumption is correct for discrete GPUs. My first guess is that on hosts with APUs, we'll need to set --R and --V in the test runner as you initially suggested, with something like 45% of system memory each. Best, Christian [8]: https://sources.debian.org/src/rocfft/6.1.2-1/clients/tests/gtest_main.cpp/#L329