On 2024-09-28 01:11, Cordell Bloor wrote:
> "Killed" disappeared when I ran it myself in both cases. However, it did
> get further with vm.overcommit_memory=0:

Hm, odd. The OOM killer does kill rocfft-test, from dmesg:

> [  633.776686] Out of memory: Killed process 4053 (rocfft-test)

and I would assume that this would be logged with "Killed" again.

Regardless:

> [ RUN      ]
> pow2_1D/accuracy_test.vs_fftw/complex_forward_len_134217728_single_op_batch_1_istride_1_CI_ostride_1_CI_idist_134217728_odist_134217728_ioffset_0_0_ooffset_0_0
> [       OK ]
> pow2_1D/accuracy_test.vs_fftw/complex_forward_len_134217728_single_op_batch_1_istride_1_CI_ostride_1_CI_idist_134217728_odist_134217728_ioffset_0_0_ooffset_0_0
>  (1771 ms)
> [ RUN      ]
> pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_double_ip_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0
> clients/tests/accuracy_test.h:1214: Skipped
> needed_ramgb: 96, ramgb limit: 61.

This is the red flag right there, the test believes it has 61GiB of memory 
available (and it's skipped because it needs 96GiB).

> [  SKIPPED ]
> pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_double_ip_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0
>  (0 ms)

Now if the system has 64GiB physical and the GPU at some point is using 32GiB 
of it, then of course the assumption of having 61GiB virtual available and 
attempting to use more than ~32GiB of it will eventually lead to failure.

I'm not yet familiar with how the driver dynamically allocates memory in 6.10, 
but an obvious way to run into this error is to query memory parameters at test 
start (GPU: 32GiB, System: 61GiB) and to assume that these are static. And 
grepping for "ramgb", it seems that this is exactly what is happening [8].

I don't think this is necessarily a bug in rocfft's tests, as this assumption 
is correct for discrete GPUs.

My first guess is that on hosts with APUs, we'll need to set --R and --V in the 
test runner as you initially suggested, with something like 45% of system 
memory each.

Best,
Christian

[8]: 
https://sources.debian.org/src/rocfft/6.1.2-1/clients/tests/gtest_main.cpp/#L329

Reply via email to