[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

schulz.benjamin at googlemail dot com via Gcc-bugs Thu, 30 Oct 2025 05:01:22 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280


--- Comment #3 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Hi there, I now have  gcc 
sys-devel/gcc-16.0.0_p20251026, 
sys-devel/gcc-15.2.1_p20251018,  
sys-devel/gcc-14.3.1_p20251017

and clang version 21.1.4 installed.

I may do further tests tomorrow. But, last time, the code also ran without any
errors.

I dont know where these libgomp errors are coming from. But before we do
further tests:


Given that teams distribute is something very different than parallel for,

do we agree that


 #pragma omp target teams distribute parallel for collapse(2) shared(A,B,C)
device(dev)
    for (size_t i = 0; i < rows; ++i)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = 0;
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }


should yield the same numbers as this


 #pragma omp target teams distribute shared(A,B,C) device(dev)
    for (size_t i = 0; i < rows; ++i)
{
        #pragma omp parallel for shared(A,B,C)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = 0;
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }
}

and this

 #pragma omp target parallel for collapse(2) shared(A,B,C) device(dev)
    for (size_t i = 0; i < rows; ++i)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = 0;
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
         }

Just that we are on the same page here....

What I can say is that on host code, I've always used collapse(2) on this
nested loop and it worked there. On my system, when I run it a few weeks ago,
the parallel for collapse(2) version also worked, just not the teams distribute
parallell for collapse(2)...

But I will test again tomorrow (it wants to emerge the compilers again due to
changed useflags, which takes time...)

[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

Reply via email to