https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280
--- Comment #3 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Hi there, I now have gcc
sys-devel/gcc-16.0.0_p20251026,
sys-devel/gcc-15.2.1_p20251018,
sys-devel/gcc-14.3.1_p20251017
and clang version 21.1.4 installed.
I may do further tests tomorrow. But, last time, the code also ran without any
errors.
I dont know where these libgomp errors are coming from. But before we do
further tests:
Given that teams distribute is something very different than parallel for,
do we agree that
#pragma omp target teams distribute parallel for collapse(2) shared(A,B,C)
device(dev)
for (size_t i = 0; i < rows; ++i)
for (size_t j = 0; j < cols; ++j)
{
T sum = 0;
#pragma omp simd reduction(+:sum)
for (size_t k = 0; k < inner_dim; ++k)
{
sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
}
C.dpdata[i*Cstr0+j*Cstr1]= sum;
}
should yield the same numbers as this
#pragma omp target teams distribute shared(A,B,C) device(dev)
for (size_t i = 0; i < rows; ++i)
{
#pragma omp parallel for shared(A,B,C)
for (size_t j = 0; j < cols; ++j)
{
T sum = 0;
#pragma omp simd reduction(+:sum)
for (size_t k = 0; k < inner_dim; ++k)
{
sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
}
C.dpdata[i*Cstr0+j*Cstr1]= sum;
}
}
and this
#pragma omp target parallel for collapse(2) shared(A,B,C) device(dev)
for (size_t i = 0; i < rows; ++i)
for (size_t j = 0; j < cols; ++j)
{
T sum = 0;
#pragma omp simd reduction(+:sum)
for (size_t k = 0; k < inner_dim; ++k)
{
sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
}
C.dpdata[i*Cstr0+j*Cstr1]= sum;
}
Just that we are on the same page here....
What I can say is that on host code, I've always used collapse(2) on this
nested loop and it worked there. On my system, when I run it a few weeks ago,
the parallel for collapse(2) version also worked, just not the teams distribute
parallell for collapse(2)...
But I will test again tomorrow (it wants to emerge the compilers again due to
changed useflags, which takes time...)