[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

schulz.benjamin at googlemail dot com via Gcc-bugs Thu, 30 Oct 2025 20:54:30 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280


--- Comment #8 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
yes when i use this




template <typename T>
void GPU_Math_Functions<T>::matrix_multiply_dot_g( const DataBlock<T>& A, const
DataBlock<T>& B, DataBlock<T>& C,int dev,bool update_host)
{
    const size_t rows=A.dpextents[0];
    const size_t cols=B.dpextents[1];
    const size_t inner_dim=A.dpextents[1];

    //these functions check isdevptr to see whether data was allocated with
malloc. they do only offload if that is not the case.
    typename DataBlock_GPU_Memory_Functions<T>::OffloadHelperConst offloadA(A,
dev, false);
    typename DataBlock_GPU_Memory_Functions<T>::OffloadHelperConst offloadB(B,
dev, false);
    typename DataBlock_GPU_Memory_Functions<T>::OffloadHelper offloadC(C, dev,
true, update_host);

    const size_t Astr0=A.dpstrides[0];
    const size_t Astr1=A.dpstrides[1];
    const size_t Bstr0=B.dpstrides[0];
    const size_t Bstr1=B.dpstrides[1];
    const size_t Cstr0=C.dpstrides[0];
    const size_t Cstr1=C.dpstrides[1];

    #pragma omp target teams distribute shared(A,B,C) device(dev)
    for (size_t i = 0; i < rows; ++i)
        #pragma omp parallel for shared(A,B,C)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = T(0);
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }


}


then the multiplications on host and gpu agree...

[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]
the header In_Kernel_mathfunctions executes math functions either on the host
or can run them in parallel. Abbreviations w mean with parallel for
per default update_host is set to true. If one has several calculations on gpu,
this may not be desired and can be switched to false
[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]
CPU_ONLY lets it multiply on CPU. GPU_ONLY executes on gpu. AUTO lets the
library decide based on whether the data is already on gpu, the algorithm, and
the data size.
supplying nullptr instead of a pointer to Math_Functions_Policy lets the
library use a global default that can be configured.
per default update_host is set to true. If one has several calculations on gpu,
this may not be desired and can be switched to false
[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]


But, well teams distribute parallel for supports collapse... And also shared so 
one should be able to use it... For matrices that are very rectangular, not
using collapse(2) is a severe performance downgrade here...



But also... the entire code, even with the advanced algorithms, are not
deviating at all from the standards, i think.. there should not be such things
for the LU decomposition with the advanced algorithm like this:

ibgomp: cuCtxSynchronize error: an illegal memory access was encountered

libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered

libgomp: cuMemFree_v2 error: an illegal memory access was encountered

libgomp: device finalization failed



Clang has no support for simd on openmp target devices. But otherwise, it
compiles my code (just for the warnings that there is no openmp simd on
target).

I think having something not implemented is better than giving wrong numbers
without errors, and libgomp errors like this...

Whatever this comes from. In June, I could swear my code worked with gcc.

[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

Reply via email to