[Bug target/121818] miscompilation of parallel for reduction on nvptx target in a cholesky decomposition

schulz.benjamin at googlemail dot com via Gcc-bugs Fri, 05 Sep 2025 18:38:30 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121818


--- Comment #2 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
hi, no it turned out that when i remove the reduction, i just get large
numerical errors but it is still not correct.

When I, however, remove the reduction clauses and use shared for the reduction
variables in the target teams distribute parallel for simd loops, together with
an 
#pragma omp atomic before the update statements,which should mimic the effects
of reduction, I.e. with this code: 

            T tmp=0,temp4=0;
            #pragma omp target map(tofrom:tmp)map(to:c)
device(policy.devicenum)
            {
            tmp=tempA(c,c);
            }

            #pragma omp target data map(tofrom:tmp)map(to:c)
device(policy.devicenum)
            #pragma omp target teams distribute parallel for simd
shared(tL,tmp)  device(policy.devicenum)
            for (size_t k = 0; k < c; ++k)
            {
                const T tmp3=tL(c,k);
                #pragma omp atomic
                tmp-= tmp3 * tmp3;
            }
            temp4=sqrt(tmp);
            #pragma omp target map(tofrom:temp4)
map(to:c)device(policy.devicenum)
            {
                tL(c,c)=temp4;
            }


            #pragma omp target  data map(to:temp4, c)device(policy.devicenum)
            #pragma omp target teams distribute parallel for shared(tempA,tL)
device(policy.devicenum)
            for (size_t i = c + 1; i < n; ++i)
            {
                T tmp2 = tempA(i, c);
                #pragma omp simd
                for (size_t k = 0; k < c; ++k)
                {
                    #pragma omp atomic
                    tmp2 -= tL(i, k) * tL(c, k);
                }
                tL(i, c)=tmp2/temp4;
            }
        }


then I get this result:


14.4914 0 0 0 0 0 0 0 
-6.3486 17.8326 0 0 0 0 0 0 
4.69245 -3.93715 14.2829 0 0 0 0 0 
-2.27722 6.47932 -4.18713 18.3848 0 0 0 0 
-2.34622 -9.4151 1.04607 -5.00193 15.0246 0 0 0 
-0.276026 -3.68721 -5.75668 -2.81983 -5.05116 10.9653 0 0 
8.14277 11.8713 -0.523037 -2.64159 13.7386 5.61001 10.2721 0 
-0.414039 1.70315 -1.21485 0.268437 0.689188 0.760506 -3.92146 15.3095 


one sees that this is a bit more correct than the nans but still, the result on
cpu is this

14.4914 0 0 0 0 0 0 0 
-6.3486 16.6642 0 0 0 0 0 0 
4.69245 -4.2132 12.8152 0 0 0 0 0 
-2.27722 6.9336 -4.37774 16.2965 0 0 0 0 
-2.34622 -10.0752 0.74604 -5.16795 14.5506 0 0 0 
-0.276026 -3.94573 -6.58037 -3.257 -2.84005 9.86812 0 0 
8.14277 12.7036 -0.0535879 -3.54515 6.79111 -1.94966 5.46098 0 
-0.414039 1.82256 -1.27804 0.173372 -0.395814 0.314913 -1.63587 15.1958 

One sees stuff like sign flips... these are hardly rounding errors... and the
other values are off then... 

Again, with the reduction in the loops,

  T tmp=0,temp4=0;
            #pragma omp target map(tofrom:tmp)map(to:c)
device(policy.devicenum)
            {
            tmp=tempA(c,c);
            }

            #pragma omp target data map(tofrom:tmp)map(to:c)
device(policy.devicenum)
            #pragma omp target teams distribute parallel for simd shared(tL)
reduction(-:tmp) device(policy.devicenum)
            for (size_t k = 0; k < c; ++k)
            {
                const T tmp3=tL(c,k);
                tmp-= tmp3 * tmp3;
            }
            temp4=sqrt(tmp);
            #pragma omp target map(tofrom:temp4)
map(to:c)device(policy.devicenum)
            {
                tL(c,c)=temp4;
            }


            #pragma omp target  data map(to:temp4, c)device(policy.devicenum)
            #pragma omp target teams distribute parallel for shared(tempA,tL)
device(policy.devicenum)
            for (size_t i = c + 1; i < n; ++i)
            {
                T tmp2 = tempA(i, c);
                #pragma omp simd reduction(-:tmp2)
                for (size_t k = 0; k < c; ++k)
                {
                    tmp2 -= tL(i, k) * tL(c, k);
                }
                tL(i, c)=tmp2/temp4;
            }
        }



it looks like this:


14.4914 0 0 0 0 0 0 0 
-6.3486 16.6642 0 0 0 0 0 0 
4.69245 -4.2132 12.8152 0 0 0 0 0 
-2.27722 6.9336 -4.37774 16.2965 0 0 0 0 
-2.34622 -10.0752 0.74604 -5.16795 8.80012 0 0 0 
-0.276026 -3.94573 -6.58037 -3.257 -10.6418 -nan 0 0 
8.14277 12.7036 -0.0535879 -3.54515 25.8667 -nan -nan 0 
-0.414039 1.82256 -1.27804 0.173372 1.53195 -nan -nan -nan 

That makes no sense. A reduction should yield the same result than a shared
variable in a teams of threads with parallel for when the update statement has
a pragma atomic update

Is this a cuda/gpu driver problem? Or something miscompiled here?

[Bug target/121818] miscompilation of parallel for reduction on nvptx target in a cholesky decomposition

Reply via email to