[Bug target/121818] New: miscompilation of parallel for reduction on nvptx target in a cholesky decomposition

schulz.benjamin at googlemail dot com via Gcc-bugs Fri, 05 Sep 2025 16:39:10 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121818


            Bug ID: 121818
           Summary: miscompilation of parallel for reduction on nvptx
                    target in a cholesky decomposition
           Product: gcc
           Version: 15.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: schulz.benjamin at googlemail dot com
  Target Milestone: ---

Created attachment 62320
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62320&action=edit
arraytest.xnvptx-none.ii

Hi there, I have a program which compiles a cholesky decomposition on target.

the function is on line 1451 of mathfunctions_mpi.h 



The function yields with gcc 15.2 and options  -fopenmp -foffload=nvptx-none -g
-fno-stack-protector -Wextra  -fno-strict-aliasing -fwrapv
-fno-aggressive-loop-optimizations  -save-temps -Wall")


the wrong output for a test matrix

210 -92 68 -33 -34 -4 118 -6 
-92 318 -100 130 -153 -64 160 33 
68 -100 204 -96 41 -69 -16 -26 
-33 130 -96 338 -152 -51 12 22 
-34 -153 41 -152 346 11 -30 -25 
-4 -64 -69 -51 11 175 -79 5 
118 160 -16 12 -30 -79 320 7 
-6 33 -26 22 -25 5 7 239 

I get:

14.4914 0 0 0 0 0 0 0 
-6.3486 16.6642 0 0 0 0 0 0 
4.69245 -4.2132 12.8152 0 0 0 0 0 
-2.27722 6.9336 -4.37774 16.2965 0 0 0 0 
-2.34622 -10.0752 0.74604 -5.16795 8.80012 0 0 0 
-0.276026 -3.94573 -6.58037 -3.257 -10.6418 -nan 0 0 
8.14277 12.7036 -0.0535879 -3.54515 25.8667 -nan -nan 0 
-0.414039 1.82256 -1.27804 0.173372 1.53195 -nan -nan -nan 


the function has a decision whether it executes on gpu or cpu, the cpu code is
more or less the same. 

(basically just without the offload code, and without target and teams
distribute statements. On CPU , the code yields the correct results.


However, I can change the code for the target to make it yield correct results
on the gpu. 


Within the the function for the cholesky decomposition is a loop on line 1610
of mathfunctions_mpi.h

    T tmp=0,temp4=0;
            #pragma omp target parallel for simd reduction(+:tmp)
device(policy.devicenum)
            for (size_t k = 0; k < c; ++k)
            {
                const T tmp3=tL(c,k);
                tmp+= tmp3 * tmp3;
            }


If I change it to this one

    T tmp=0,temp4=0;
            #pragma omp target parallel for simd device(policy.devicenum)
            for (size_t k = 0; k < c; ++k)
            {
                const T tmp3=tL(c,k);
                tmp+= tmp3 * tmp3;
            }


I get correct results, but that makes no sense, as it is clearly a reduction...

The cpu code looks like that

            #pragma omp parallel for simd shared(L) reduction(+: tmp)
            for (size_t k = z; k < c; ++k)
            {
                const T tmp3=L(c,k);
                tmp-= tmp3 * tmp3;
            }

And on cpu this reduction delivers the correct results.

The reduction statement on a parallel for loop should work on target in the
same way as on device, so it makes no sense that the outputs become correct if
and only if I remove a reduction statement on device.

Therefore I suspect a compilation error in gcc 15.2 on nvptx target.


I compiled with 

-fopenmp -foffload=nvptx-none -g -fno-stack-protector -Wextra 
-fno-strict-aliasing -fwrapv -fno-aggressive-loop-optimizations and got no
warnings.



Interesting is if i compile it with:-fsanitize=address,undefined then it
complains about 

"lto1: error: Variable »*.Lubsan_data1549« was referenced in outsourced code
but not marked that it is contained in outsourced code."

And similarly for »*.Lubsan_data1550« *.Lubsan_data1565« .Lubsan_data1566«
.Lubsan_data1567« *.Lubsan_data1568« 


but none of the variables in my code are named like this and from the message,
it seems to be something about code that is called gcc not written by me.

Anyway, this probably does not have something to do with my quite likely
miscompiled for loop.


I have added the i file after compiling with -save-temps that you request for
bug reports, and i will add the sourcecode of my library with headers, and
cmakelists.txt as a tar gz archive

[Bug target/121818] New: miscompilation of parallel for reduction on nvptx target in a cholesky decomposition

Reply via email to