https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280

--- Comment #3 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Hi there, I now have  gcc 
sys-devel/gcc-16.0.0_p20251026, 
sys-devel/gcc-15.2.1_p20251018,  
sys-devel/gcc-14.3.1_p20251017

and clang version 21.1.4 installed.

I may do further tests tomorrow. But, last time, the code also ran without any
errors.

I dont know where these libgomp errors are coming from. But before we do
further tests:


Given that teams distribute is something very different than parallel for,

do we agree that


 #pragma omp target teams distribute parallel for collapse(2) shared(A,B,C)
device(dev)
    for (size_t i = 0; i < rows; ++i)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = 0;
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }


should yield the same numbers as this


 #pragma omp target teams distribute shared(A,B,C) device(dev)
    for (size_t i = 0; i < rows; ++i)
{
        #pragma omp parallel for shared(A,B,C)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = 0;
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }
}

and this

 #pragma omp target parallel for collapse(2) shared(A,B,C) device(dev)
    for (size_t i = 0; i < rows; ++i)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = 0;
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
         }

Just that we are on the same page here....

What I can say is that on host code, I've always used collapse(2) on this
nested loop and it worked there. On my system, when I run it a few weeks ago,
the parallel for collapse(2) version also worked, just not the teams distribute
parallell for collapse(2)...

But I will test again tomorrow (it wants to emerge the compilers again due to
changed useflags, which takes time...)

Reply via email to