https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121818
Bug ID: 121818
Summary: miscompilation of parallel for reduction on nvptx
target in a cholesky decomposition
Product: gcc
Version: 15.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: schulz.benjamin at googlemail dot com
Target Milestone: ---
Created attachment 62320
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62320&action=edit
arraytest.xnvptx-none.ii
Hi there, I have a program which compiles a cholesky decomposition on target.
the function is on line 1451 of mathfunctions_mpi.h
The function yields with gcc 15.2 and options -fopenmp -foffload=nvptx-none -g
-fno-stack-protector -Wextra -fno-strict-aliasing -fwrapv
-fno-aggressive-loop-optimizations -save-temps -Wall")
the wrong output for a test matrix
210 -92 68 -33 -34 -4 118 -6
-92 318 -100 130 -153 -64 160 33
68 -100 204 -96 41 -69 -16 -26
-33 130 -96 338 -152 -51 12 22
-34 -153 41 -152 346 11 -30 -25
-4 -64 -69 -51 11 175 -79 5
118 160 -16 12 -30 -79 320 7
-6 33 -26 22 -25 5 7 239
I get:
14.4914 0 0 0 0 0 0 0
-6.3486 16.6642 0 0 0 0 0 0
4.69245 -4.2132 12.8152 0 0 0 0 0
-2.27722 6.9336 -4.37774 16.2965 0 0 0 0
-2.34622 -10.0752 0.74604 -5.16795 8.80012 0 0 0
-0.276026 -3.94573 -6.58037 -3.257 -10.6418 -nan 0 0
8.14277 12.7036 -0.0535879 -3.54515 25.8667 -nan -nan 0
-0.414039 1.82256 -1.27804 0.173372 1.53195 -nan -nan -nan
the function has a decision whether it executes on gpu or cpu, the cpu code is
more or less the same.
(basically just without the offload code, and without target and teams
distribute statements. On CPU , the code yields the correct results.
However, I can change the code for the target to make it yield correct results
on the gpu.
Within the the function for the cholesky decomposition is a loop on line 1610
of mathfunctions_mpi.h
T tmp=0,temp4=0;
#pragma omp target parallel for simd reduction(+:tmp)
device(policy.devicenum)
for (size_t k = 0; k < c; ++k)
{
const T tmp3=tL(c,k);
tmp+= tmp3 * tmp3;
}
If I change it to this one
T tmp=0,temp4=0;
#pragma omp target parallel for simd device(policy.devicenum)
for (size_t k = 0; k < c; ++k)
{
const T tmp3=tL(c,k);
tmp+= tmp3 * tmp3;
}
I get correct results, but that makes no sense, as it is clearly a reduction...
The cpu code looks like that
#pragma omp parallel for simd shared(L) reduction(+: tmp)
for (size_t k = z; k < c; ++k)
{
const T tmp3=L(c,k);
tmp-= tmp3 * tmp3;
}
And on cpu this reduction delivers the correct results.
The reduction statement on a parallel for loop should work on target in the
same way as on device, so it makes no sense that the outputs become correct if
and only if I remove a reduction statement on device.
Therefore I suspect a compilation error in gcc 15.2 on nvptx target.
I compiled with
-fopenmp -foffload=nvptx-none -g -fno-stack-protector -Wextra
-fno-strict-aliasing -fwrapv -fno-aggressive-loop-optimizations and got no
warnings.
Interesting is if i compile it with:-fsanitize=address,undefined then it
complains about
"lto1: error: Variable »*.Lubsan_data1549« was referenced in outsourced code
but not marked that it is contained in outsourced code."
And similarly for »*.Lubsan_data1550« *.Lubsan_data1565« .Lubsan_data1566«
.Lubsan_data1567« *.Lubsan_data1568«
but none of the variables in my code are named like this and from the message,
it seems to be something about code that is called gcc not written by me.
Anyway, this probably does not have something to do with my quite likely
miscompiled for loop.
I have added the i file after compiling with -save-temps that you request for
bug reports, and i will add the sourcecode of my library with headers, and
cmakelists.txt as a tar gz archive