[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

schulz.benjamin at googlemail dot com via Gcc-bugs Thu, 30 Oct 2025 20:23:22 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280


--- Comment #6 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
With clang, I get the following output. Clang is not able to do simd on device
(sadly) but at least, well, this is the output:


obviously, the collapse(2) statement can, and should problems be hanged on the
gpu loops of a matrix multiplication...




This demonstrates basic mathematical abilities of the library on gpu, cpu and
with the message passing interface
We can also use a more simplified interface for writing expressions. Although
evaluations of more than one operator are not yet supported.
define A
[[1, 2, 3], 
 [4, 5, 6]]
define B
[[6, 5, 4], 
 [3, 2, 1]]
addition of A and B
[[7, 7, 7], 
 [7, 7, 7]]
multiplication of A and transpose of B
[[28, 10], 
 [73, 28]]
Subtraction of A. one can also assign the type later, as in this example, but
E=A-B would also work here
But here we set a poliy to do this on gpu
[[-5, -3, -1], 
 [1, 3, 5]]
two vectors
[1, 2, 3]
[6, 5, 4]
a scalar product between two vectors
28
28We define two matrices
the same code base can have the strides and extents on heap(vector) or on the
stack(array). 
The library works as well with col major data but in this example, we define
row-major data
Ordinary matrix multiplication, foced on gpu with a policy object
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], 
 [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1], 
 [2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11], 
 [11, 9, 7, 5, 3, 1, 12, 10, 8, 6, 4, 2], 
 [3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10], 
 [10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3], 
 [4, 8, 12, 3, 7, 11, 2, 6, 10, 1, 5, 9], 
 [9, 5, 1, 7, 3, 11, 8, 4, 12, 6, 2, 10], 
 [5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 12], 
 [12, 7, 2, 9, 4, 11, 6, 1, 8, 3, 10, 5], 
 [6, 1, 8, 3, 10, 5, 12, 7, 2, 9, 4, 11], 
 [11, 2, 9, 4, 12, 7, 3, 10, 5, 1, 8, 6]]
[[12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1], 
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], 
 [3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10], 
 [10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3], 
 [5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 12], 
 [12, 9, 6, 3, 10, 7, 4, 1, 8, 5, 2, 11], 
 [2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11], 
 [11, 8, 5, 2, 9, 6, 3, 12, 7, 4, 1, 10], 
 [3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10], 
 [10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3], 
 [4, 8, 12, 3, 7, 11, 2, 6, 10, 1, 5, 9], 
 [9, 5, 1, 7, 3, 11, 8, 4, 12, 6, 2, 10]]
the header In_Kernel_mathfunctions executes math functions either on the host
or can run them in parallel. Abbreviations v just with simd, s without parallel
loops
per default update_host is set to true. If one has several calculations on gpu,
this may not be desired and can be switched to false
[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]
the header In_Kernel_mathfunctions executes math functions either on the host
or can run them in parallel. Abbreviations w mean with parallel for
per default update_host is set to true. If one has several calculations on gpu,
this may not be desired and can be switched to false
[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]
CPU_ONLY lets it multiply on CPU. GPU_ONLY executes on gpu. AUTO lets the
library decide based on whether the data is already on gpu, the algorithm, and
the data size.
supplying nullptr instead of a pointer to Math_Functions_Policy lets the
library use a global default that can be configured.
per default update_host is set to true. If one has several calculations on gpu,
this may not be desired and can be switched to false
[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]
We can also use the Strassen algorithm or its Winograd variant for the
multiplication.
It may offload on gpu. With the Message Passing Interface enabled, it can do so
in parallel. 
otherwise it offloads sequentially. The algorithm can also work entirely on
device with devicepointers to the data
in auto mode, the following default treshholds are set in mathfunctions.h and
can be changed for convenience
max_problem_size_for_gpu;This is the size of the gpu memory, data larger than
this is not offloaded
 default_cubic_treshold = 256;The default number of elements at which matrices
are auto offloaded in multiplication
 default_square_treshold = 1000;The default number of elements at which
matrices are auto offloaded for addition
 default_linear_treshold = 1000000;The default number of elements at which
vectors are auto offloaded for addition

we now set it on gpu and set the size when to stop recursion to 2, per default,
this is at 64
[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]
We create a 4x4 matrix that owns its own data buffer in a memapped file and
then fill the buffer and print it
usually, the own data buffer is more interesting for storing the results of the
computation and for intermediary evaluations
[[0, 1, 2, 3], 
 [4, 5, 6, 7], 
 [8, 9, 10, 11], 
 [12, 13, 14, 15]]
now we create a 4x4 matrix with data in a separate vector
[[2, 2, 2, 2], 
 [2, 2, 2, 2], 
 [2, 2, 2, 2], 
 [2, 2, 2, 2]]
now we make a shallow copy of the first matrix on the second
[[0, 1, 2, 3], 
 [4, 5, 6, 7], 
 [8, 9, 10, 11], 
 [12, 13, 14, 15]]
We test the shallow copy by setting the first element of the first matrix to 42
and then print the first and second matrix
[[42, 1, 2, 3], 
 [4, 5, 6, 7], 
 [8, 9, 10, 11], 
 [12, 13, 14, 15]]
[[42, 1, 2, 3], 
 [4, 5, 6, 7], 
 [8, 9, 10, 11], 
 [12, 13, 14, 15]]
Now we test more advanced algorithms




Now a cholesky decomposition on CPU
The result is put in an mdspan_data, which allocates its own ressourceswith the
dataset
[[210, -92, 68, -33, -34, -4, 118, -6], 
 [-92, 318, -100, 130, -153, -64, 160, 33], 
 [68, -100, 204, -96, 41, -69, -16, -26], 
 [-33, 130, -96, 338, -152, -51, 12, 22], 
 [-34, -153, 41, -152, 346, 11, -30, -25], 
 [-4, -64, -69, -51, 11, 175, -79, 5], 
 [118, 160, -16, 12, -30, -79, 320, 7], 
 [-6, 33, -26, 22, -25, 5, 7, 239]]
[[14.4914, 0, 0, 0, 0, 0, 0, 0], 
 [-6.3486, 16.6642, 0, 0, 0, 0, 0, 0], 
 [4.69245, -4.2132, 12.8152, 0, 0, 0, 0, 0], 
 [-2.27722, 6.9336, -4.37774, 16.2965, 0, 0, 0, 0], 
 [-2.34622, -10.0752, 0.74604, -5.16795, 14.5506, 0, 0, 0], 
 [-0.276026, -3.94573, -6.58037, -3.257, -2.84005, 9.86812, 0, 0], 
 [8.14277, 12.7036, -0.0535879, -3.54515, 6.79111, -1.94966, 5.46098, 0], 
 [-0.414039, 1.82256, -1.27804, 0.173372, -0.395814, 0.314913, -1.63587,
15.1958]]
we can verify the cholesky decomposition by multiplication
We can create a transpose with the base class DataBlock, but also with mdspan
[[210, -92, 68, -33, -34, -4, 118, -6], 
 [-92, 318, -100, 130, -153, -64, 160, 33], 
 [68, -100, 204, -96, 41, -69, -16, -26], 
 [-33, 130, -96, 338, -152, -51, 12, 22], 
 [-34, -153, 41, -152, 346, 11, -30, -25], 
 [-4, -64, -69, -51, 11, 175, -79, 5], 
 [118, 160, -16, 12, -30, -79, 320, 7], 
 [-6, 33, -26, 22, -25, 5, 7, 239]]
Now the cholesky decomposition is entirely done on GPU
[[14.4914, 0, 0, 0, 0, 0, 0, 0], 
 [-6.3486, 16.6642, 0, 0, 0, 0, 0, 0], 
 [4.69245, -4.2132, 12.8152, 0, 0, 0, 0, 0], 
 [-2.27722, 6.9336, -4.37774, 16.2965, 0, 0, 0, 0], 
 [-2.34622, -10.0752, 0.74604, -5.16795, 14.5506, 0, 0, 0], 
 [-0.276026, -3.94573, -6.58037, -3.257, -2.84005, 9.86812, 0, 0], 
 [8.14277, 12.7036, -0.0535879, -3.54515, 6.79111, -1.94966, 5.46098, 0], 
 [-0.414039, 1.82256, -1.27804, 0.173372, -0.395814, 0.314913, -1.63587,
15.1958]]
we can verify the cholesky decomposition by multiplication
Here we create the transpose with mdspan
[[210, -92, 68, -33, -34, -4, 118, -6], 
 [-92, 318, -100, 130, -153, -64, 160, 33], 
 [68, -100, 204, -96, 41, -69, -16, -26], 
 [-33, 130, -96, 338, -152, -51, 12, 22], 
 [-34, -153, 41, -152, 346, 11, -30, -25], 
 [-4, -64, -69, -51, 11, 175, -79, 5], 
 [118, 160, -16, 12, -30, -79, 320, 7], 
 [-6, 33, -26, 22, -25, 5, 7, 239]]
With the advanced algorithms on GPU
[[210, -92, 68, -33, -34, -4, 118, -6], 
 [-92, 318, -100, 130, -153, -64, 160, 33], 
 [68, -100, 204, -96, 41, -69, -16, -26], 
 [-33, 130, -96, 338, -152, -51, 12, 22], 
 [-34, -153, 41, -152, 346, 11, -30, -25], 
 [-4, -64, -69, -51, 11, 175, -79, 5], 
 [118, 160, -16, 12, -30, -79, 320, 7], 
 [-6, 33, -26, 22, -25, 5, 7, 239]]
[[14.4914, 0, 0, 0, 0, 0, 0, 0], 
 [-6.3486, 16.6642, 0, 0, 0, 0, 0, 0], 
 [4.69245, -4.2132, 12.8152, 0, 0, 0, 0, 0], 
 [-2.27722, 6.9336, -4.37774, 16.2965, 0, 0, 0, 0], 
 [-2.34622, -10.0752, 0.74604, -5.16795, 14.5506, 0, 0, 0], 
 [-0.276026, -3.94573, -6.58037, -3.257, -2.84005, 9.86812, 0, 0], 
 [8.14277, 12.7036, -0.0535879, -3.54515, 6.79111, -1.94966, 5.46098, 0], 
 [-0.414039, 1.82256, -1.27804, 0.173372, -0.395814, 0.314913, -1.63587,
15.1958]]
we can verify the cholesky decomposition by multiplication
[[210, -92, 68, -33, -34, -4, 118, -6], 
 [-92, 318, -100, 130, -153, -64, 160, 33], 
 [68, -100, 204, -96, 41, -69, -16, -26], 
 [-33, 130, -96, 338, -152, -51, 12, 22], 
 [-34, -153, 41, -152, 346, 11, -30, -25], 
 [-4, -64, -69, -51, 11, 175, -79, 5], 
 [118, 160, -16, 12, -30, -79, 320, 7], 
 [-6, 33, -26, 22, -25, 5, 7, 239]]
Now we do the same with the lu decomposition of
[[-3, 3, -3, 5, 2, 7, 4, 2], 
 [-2, 4, 2, -10, -4, -2, -10, 1], 
 [-3, 0, 8, 6, -3, -8, -8, -10], 
 [-6, -1, -4, -2, -4, -2, -3, 1], 
 [-9, -10, 5, -6, -8, 1, -3, -8], 
 [-10, -8, -6, 4, 3, -8, -10, -6], 
 [3, -4, -2, 4, 4, -1, 2, 8], 
 [-4, 6, 9, -7, -6, -4, 2, 4]]
on CPU
[[1, 0, 0, 0, 0, 0, 0, 0], 
 [0.666667, 1, 0, 0, 0, 0, 0, 0], 
 [1, -1.5, 1, 0, 0, 0, 0, 0], 
 [2, -3.5, 0.941176, 1, 0, 0, 0, 0], 
 [3, -9.5, 3.05882, 2.19567, 1, 0, 0, 0], 
 [3.33333, -9, 2.35294, 2.15673, 1.48073, 1, 0, 0], 
 [-1, -0.5, -0.176471, 0.025, 0.206349, 0.178946, 1, 0], 
 [1.33333, 1, 0.529412, -0.238462, 0.015873, -0.0594817, -6.32693, 1]]
[[-3, 3, -3, 5, 2, 7, 4, 2], 
 [0, 2, 4, -13.3333, -5.33333, -6.66667, -12.6667, -0.333333], 
 [0, 0, 17, -19, -13, -25, -31, -12.5], 
 [0, 0, 0, -40.7843, -14.4314, -15.8039, -26.1569, 7.59804], 
 [0, 0, 0, 0, 6.78462, 27.8375, 16.9221, 4.38582], 
 [0, 0, 0, 0, 0, -39.6447, -33.0359, -9.13602], 
 [0, 0, 0, 0, 0, 0, -2.73024, 8.16734], 
 [0, 0, 0, 0, 0, 0, 0, 61.1573]]
we can verify the lu decomposition by multiplication
[[-3, 3, -3, 5, 2, 7, 4, 2], 
 [-2, 4, 2, -10, -4, -2, -10, 1], 
 [-3, 0, 8, 6, -3, -8, -8, -10], 
 [-6, -1, -4, -2, -4, -2, -3, 1], 
 [-9, -10, 5, -6, -8, 1, -3, -8], 
 [-10, -8, -6, 4, 3, -8, -10, -6], 
 [3, -4, -2, 4, 4, -1, 2, 8], 
 [-4, 6, 9, -7, -6, -4, 2, 4]]
Entirely on gpu
[[1, 0, 0, 0, 0, 0, 0, 0], 
 [0.666667, 1, 0, 0, 0, 0, 0, 0], 
 [1, -1.5, 1, 0, 0, 0, 0, 0], 
 [2, -3.5, 0.941176, 1, 0, 0, 0, 0], 
 [3, -9.5, 3.05882, 2.19567, 1, 0, 0, 0], 
 [3.33333, -9, 2.35294, 2.15673, 1.48073, 1, 0, 0], 
 [-1, -0.5, -0.176471, 0.025, 0.206349, 0.178946, 1, 0], 
 [1.33333, 1, 0.529412, -0.238462, 0.015873, -0.0594817, -6.32693, 1]]
[[-3, 3, -3, 5, 2, 7, 4, 2], 
 [0, 2, 4, -13.3333, -5.33333, -6.66667, -12.6667, -0.333333], 
 [0, 0, 17, -19, -13, -25, -31, -12.5], 
 [0, 0, 0, -40.7843, -14.4314, -15.8039, -26.1569, 7.59804], 
 [0, 0, 0, 0, 6.78462, 27.8375, 16.9221, 4.38582], 
 [0, 0, 0, 0, 0, -39.6447, -33.0359, -9.13602], 
 [0, 0, 0, 0, 0, 0, -2.73024, 8.16734], 
 [0, 0, 0, 0, 0, 0, 0, 61.1573]]
we can verify the lu decomposition by multiplication
[[-3, 3, -3, 5, 2, 7, 4, 2], 
 [-2, 4, 2, -10, -4, -2, -10, 1], 
 [-3, 0, 8, 6, -3, -8, -8, -10], 
 [-6, -1, -4, -2, -4, -2, -3, 1], 
 [-9, -10, 5, -6, -8, 1, -3, -8], 
 [-10, -8, -6, 4, 3, -8, -10, -6], 
 [3, -4, -2, 4, 4, -1, 2, 8], 
 [-4, 6, 9, -7, -6, -4, 2, 4]]
With the advanced algorithms on GPU
[[1, 0, 0, 0, 0, 0, 0, 0], 
 [0.666667, 1, 0, 0, 0, 0, 0, 0], 
 [1, -1.5, 1, 0, 0, 0, 0, 0], 
 [2, -3.5, 0.941176, 1, 0, 0, 0, 0], 
 [3, -9.5, 3.05882, 2.19567, 1, 0, 0, 0], 
 [3.33333, -9, 2.35294, 2.15673, 1.48073, 1, 0, 0], 
 [-1, -0.5, -0.176471, 0.025, 0.206349, 0.178946, 1, 0], 
 [1.33333, 1, 0.529412, -0.238462, 0.015873, -0.0594817, -6.32693, 1]]
we can verify the lu decomposition by multiplication
[[-3, 3, -3, 5, 2, 7, 4, 2], 
 [-2, 4, 2, -10, -4, -2, -10, 1], 
 [-3, 0, 8, 6, -3, -8, -8, -10], 
 [-6, -1, -4, -2, -4, -2, -3, 1], 
 [-9, -10, 5, -6, -8, 1, -3, -8], 
 [-10, -8, -6, 4, 3, -8, -10, -6], 
 [3, -4, -2, 4, 4, -1, 2, 8], 
 [-4, 6, 9, -7, -6, -4, 2, 4]]
Now we do the same with the qr decomposition
[[-4, 9, 4, 0, -3, -4, 8, 0], 
 [0, -7, -3, -8, -9, 1, -5, -9], 
 [-10, 1, 1, 6, -1, 5, 4, 4], 
 [8, 1, 9, -8, -6, 8, -4, -2], 
 [-4, 7, -7, 3, 7, -2, -9, 9], 
 [4, -4, 1, -3, 4, -8, 3, 6], 
 [-7, 7, -3, -7, -9, -5, -1, -7], 
 [7, 1, -9, -1, -7, 3, 5, 4]]
On cpu
[[-0.227185, 0.526694, 0.290362, -0.0520073, -0.176765, -0.456512, -0.0592619,
0.583785], 
 [0, -0.498224, -0.196168, -0.61069, -0.117755, 0.128255, -0.106856, 0.546457], 
 [-0.567962, -0.213524, 0.137142, 0.194743, -0.395699, 0.23837, 0.599478,
0.0481964], 
 [0.454369, 0.298934, 0.520144, -0.313417, -0.0917011, 0.511886, 0.254197,
0.0188181], 
 [-0.227185, 0.384344, -0.41661, 0.0823878, 0.528548, 0.471375, 0.125855,
0.320809], 
 [0.227185, -0.17082, 0.0331166, -0.118499, 0.476134, -0.4619, 0.677855,
0.0672778], 
 [-0.397573, 0.298934, -0.138174, -0.682534, -0.00122032, -0.11488, 0.0681924,
-0.49978], 
 [0.397573, 0.270464, -0.627772, 0.0389353, -0.53276, -0.0869285, 0.284785,
0.0260316]]
[[17.6068, -7.04273, 2.04466, -6.0204, -1.36311, 3.52136, -0.795147, 0.511166], 
 [-8.88178e-16, 14.0499, -0.11388, -0.384344, -1.25268, -1.36656, 1.73667,
4.45554], 
 [-8.88178e-16, -1.11022e-16, 15.5822, -1.52315, 0.490587, 2.86508, 2.61988,
-3.82087], 
 [-3.88578e-16, -1.06859e-15, -1.72085e-15, 13.9028, 13.311, 2.37641, 4.45025,
11.866], 
 [-8.88178e-16, 1.77636e-15, 8.88178e-16, 1.88738e-15, 11.8807, -8.58115,
-8.03244, 5.15164], 
 [-1.11022e-15, 1.9984e-15, 8.88178e-16, 1.77636e-15, 2.22045e-15, 10.3073,
-11.3353, 0.702835], 
 [-6.21725e-15, 1.19349e-14, 8.43769e-15, 1.11577e-14, 4.66294e-15,
1.14353e-14, 3.69791, 8.71283], 
 [2.77001e-14, -5.00294e-14, -4.15779e-14, -4.54498e-14, -2.14828e-14,
-4.61714e-14, -4.66294e-15, 2.13057]]
we can verify the qr decomposition by multiplication
[[-4, 9, 4, -2.83387e-14, -3, -4, 8, -1.86517e-14], 
 [1.66175e-14, -7, -3, -8, -9, 1, -5, -9], 
 [-10, 1, 1, 6, -1, 5, 4, 4], 
 [8, 1, 9, -8, -6, 8, -4, -2], 
 [-4, 7, -7, 3, 7, -2, -9, 9], 
 [4, -4, 1, -3, 4, -8, 3, 6], 
 [-7, 7, -3, -7, -9, -5, -1, -7], 
 [7, 1, -9, -1, -7, 3, 5, 4]]
On gpu
[[-0.227185, 0.526694, 0.290362, -0.0520073, -0.176765, -0.456512, -0.0592619,
0.583785], 
 [0, -0.498224, -0.196168, -0.61069, -0.117755, 0.128255, -0.106856, 0.546457], 
 [-0.567962, -0.213524, 0.137142, 0.194743, -0.395699, 0.23837, 0.599478,
0.0481964], 
 [0.454369, 0.298934, 0.520144, -0.313417, -0.0917011, 0.511886, 0.254197,
0.0188181], 
 [-0.227185, 0.384344, -0.41661, 0.0823878, 0.528548, 0.471375, 0.125855,
0.320809], 
 [0.227185, -0.17082, 0.0331166, -0.118499, 0.476134, -0.4619, 0.677855,
0.0672778], 
 [-0.397573, 0.298934, -0.138174, -0.682534, -0.00122032, -0.11488, 0.0681924,
-0.49978], 
 [0.397573, 0.270464, -0.627772, 0.0389353, -0.53276, -0.0869285, 0.284785,
0.0260316]]
[[17.6068, -7.04273, 2.04466, -6.0204, -1.36311, 3.52136, -0.795147, 0.511166], 
 [-8.88178e-16, 14.0499, -0.11388, -0.384344, -1.25268, -1.36656, 1.73667,
4.45554], 
 [-8.88178e-16, -1.11022e-16, 15.5822, -1.52315, 0.490587, 2.86508, 2.61988,
-3.82087], 
 [-3.88578e-16, -1.06859e-15, -1.72085e-15, 13.9028, 13.311, 2.37641, 4.45025,
11.866], 
 [-8.88178e-16, 1.77636e-15, 8.88178e-16, 1.88738e-15, 11.8807, -8.58115,
-8.03244, 5.15164], 
 [-1.11022e-15, 1.9984e-15, 8.88178e-16, 1.77636e-15, 2.22045e-15, 10.3073,
-11.3353, 0.702835], 
 [-6.21725e-15, 1.19349e-14, 8.43769e-15, 1.11577e-14, 4.66294e-15,
1.14353e-14, 3.69791, 8.71283], 
 [2.77001e-14, -5.00294e-14, -4.15779e-14, -4.54498e-14, -2.14828e-14,
-4.61714e-14, -4.66294e-15, 2.13057]]
we can verify the qr decomposition by multiplication
[[-4, 9, 4, -2.83387e-14, -3, -4, 8, -1.86517e-14], 
 [1.66175e-14, -7, -3, -8, -9, 1, -5, -9], 
 [-10, 1, 1, 6, -1, 5, 4, 4], 
 [8, 1, 9, -8, -6, 8, -4, -2], 
 [-4, 7, -7, 3, 7, -2, -9, 9], 
 [4, -4, 1, -3, 4, -8, 3, 6], 
 [-7, 7, -3, -7, -9, -5, -1, -7], 
 [7, 1, -9, -1, -7, 3, 5, 4]]
with the advanced algorithms on gpu 
[[-0.227185, 0.526694, 0.290362, -0.0520073, -0.176765, -0.456512, -0.0592619,
0.583785], 
 [0, -0.498224, -0.196168, -0.61069, -0.117755, 0.128255, -0.106856, 0.546457], 
 [-0.567962, -0.213524, 0.137142, 0.194743, -0.395699, 0.23837, 0.599478,
0.0481964], 
 [0.454369, 0.298934, 0.520144, -0.313417, -0.0917011, 0.511886, 0.254197,
0.0188181], 
 [-0.227185, 0.384344, -0.41661, 0.0823878, 0.528548, 0.471375, 0.125855,
0.320809], 
 [0.227185, -0.17082, 0.0331166, -0.118499, 0.476134, -0.4619, 0.677855,
0.0672778], 
 [-0.397573, 0.298934, -0.138174, -0.682534, -0.00122032, -0.11488, 0.0681924,
-0.49978], 
 [0.397573, 0.270464, -0.627772, 0.0389353, -0.53276, -0.0869285, 0.284785,
0.0260316]]
[[17.6068, -7.04273, 2.04466, -6.0204, -1.36311, 3.52136, -0.795147, 0.511166], 
 [1.11022e-16, 14.0499, -0.11388, -0.384344, -1.25268, -1.36656, 1.73667,
4.45554], 
 [-1.55431e-15, 1.11022e-16, 15.5822, -1.52315, 0.490587, 2.86508, 2.61988,
-3.82087], 
 [-2.05391e-15, 1.60982e-15, -2.05391e-15, 13.9028, 13.311, 2.37641, 4.45025,
11.866], 
 [1.66533e-15, -1.77636e-15, 3.33067e-16, -3.21965e-15, 11.8807, -8.58115,
-8.03244, 5.15164], 
 [4.08007e-15, -2.31759e-15, 1.97065e-15, -4.34375e-15, -1.85962e-15, 10.3073,
-11.3353, 0.702835], 
 [1.83742e-14, -1.39888e-14, 4.16334e-15, -2.16493e-14, -1.19349e-14,
-5.4956e-15, 3.69791, 8.71283], 
 [-7.1762e-14, 5.78843e-14, -1.53627e-14, 8.20038e-14, 4.60049e-14,
2.19963e-14, 6.09235e-15, 2.13057]]
we can verify the qr decomposition by multiplication
[[-4, 9, 4, 5.15967e-14, -3, -4, 8, 3.37508e-14], 
 [-3.93472e-14, -7, -3, -8, -9, 1, -5, -9], 
 [-10, 1, 1, 6, -1, 5, 4, 4], 
 [8, 1, 9, -8, -6, 8, -4, -2], 
 [-4, 7, -7, 3, 7, -2, -9, 9], 
 [4, -4, 1, -3, 4, -8, 3, 6], 
 [-7, 7, -3, -7, -9, -5, -1, -7], 
 [7, 1, -9, -1, -7, 3, 5, 4]]

[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

Reply via email to