https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280
--- Comment #6 from Benjamin Schulz <schulz.benjamin at googlemail dot com> --- With clang, I get the following output. Clang is not able to do simd on device (sadly) but at least, well, this is the output: obviously, the collapse(2) statement can, and should problems be hanged on the gpu loops of a matrix multiplication... This demonstrates basic mathematical abilities of the library on gpu, cpu and with the message passing interface We can also use a more simplified interface for writing expressions. Although evaluations of more than one operator are not yet supported. define A [[1, 2, 3], [4, 5, 6]] define B [[6, 5, 4], [3, 2, 1]] addition of A and B [[7, 7, 7], [7, 7, 7]] multiplication of A and transpose of B [[28, 10], [73, 28]] Subtraction of A. one can also assign the type later, as in this example, but E=A-B would also work here But here we set a poliy to do this on gpu [[-5, -3, -1], [1, 3, 5]] two vectors [1, 2, 3] [6, 5, 4] a scalar product between two vectors 28 28We define two matrices the same code base can have the strides and extents on heap(vector) or on the stack(array). The library works as well with col major data but in this example, we define row-major data Ordinary matrix multiplication, foced on gpu with a policy object [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1], [2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11], [11, 9, 7, 5, 3, 1, 12, 10, 8, 6, 4, 2], [3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10], [10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3], [4, 8, 12, 3, 7, 11, 2, 6, 10, 1, 5, 9], [9, 5, 1, 7, 3, 11, 8, 4, 12, 6, 2, 10], [5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 12], [12, 7, 2, 9, 4, 11, 6, 1, 8, 3, 10, 5], [6, 1, 8, 3, 10, 5, 12, 7, 2, 9, 4, 11], [11, 2, 9, 4, 12, 7, 3, 10, 5, 1, 8, 6]] [[12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], [3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10], [10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3], [5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 12], [12, 9, 6, 3, 10, 7, 4, 1, 8, 5, 2, 11], [2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11], [11, 8, 5, 2, 9, 6, 3, 12, 7, 4, 1, 10], [3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10], [10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3], [4, 8, 12, 3, 7, 11, 2, 6, 10, 1, 5, 9], [9, 5, 1, 7, 3, 11, 8, 4, 12, 6, 2, 10]] the header In_Kernel_mathfunctions executes math functions either on the host or can run them in parallel. Abbreviations v just with simd, s without parallel loops per default update_host is set to true. If one has several calculations on gpu, this may not be desired and can be switched to false [[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]] the header In_Kernel_mathfunctions executes math functions either on the host or can run them in parallel. Abbreviations w mean with parallel for per default update_host is set to true. If one has several calculations on gpu, this may not be desired and can be switched to false [[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]] CPU_ONLY lets it multiply on CPU. GPU_ONLY executes on gpu. AUTO lets the library decide based on whether the data is already on gpu, the algorithm, and the data size. supplying nullptr instead of a pointer to Math_Functions_Policy lets the library use a global default that can be configured. per default update_host is set to true. If one has several calculations on gpu, this may not be desired and can be switched to false [[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]] We can also use the Strassen algorithm or its Winograd variant for the multiplication. It may offload on gpu. With the Message Passing Interface enabled, it can do so in parallel. otherwise it offloads sequentially. The algorithm can also work entirely on device with devicepointers to the data in auto mode, the following default treshholds are set in mathfunctions.h and can be changed for convenience max_problem_size_for_gpu;This is the size of the gpu memory, data larger than this is not offloaded default_cubic_treshold = 256;The default number of elements at which matrices are auto offloaded in multiplication default_square_treshold = 1000;The default number of elements at which matrices are auto offloaded for addition default_linear_treshold = 1000000;The default number of elements at which vectors are auto offloaded for addition we now set it on gpu and set the size when to stop recursion to 2, per default, this is at 64 [[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]] We create a 4x4 matrix that owns its own data buffer in a memapped file and then fill the buffer and print it usually, the own data buffer is more interesting for storing the results of the computation and for intermediary evaluations [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]] now we create a 4x4 matrix with data in a separate vector [[2, 2, 2, 2], [2, 2, 2, 2], [2, 2, 2, 2], [2, 2, 2, 2]] now we make a shallow copy of the first matrix on the second [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]] We test the shallow copy by setting the first element of the first matrix to 42 and then print the first and second matrix [[42, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]] [[42, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]] Now we test more advanced algorithms Now a cholesky decomposition on CPU The result is put in an mdspan_data, which allocates its own ressourceswith the dataset [[210, -92, 68, -33, -34, -4, 118, -6], [-92, 318, -100, 130, -153, -64, 160, 33], [68, -100, 204, -96, 41, -69, -16, -26], [-33, 130, -96, 338, -152, -51, 12, 22], [-34, -153, 41, -152, 346, 11, -30, -25], [-4, -64, -69, -51, 11, 175, -79, 5], [118, 160, -16, 12, -30, -79, 320, 7], [-6, 33, -26, 22, -25, 5, 7, 239]] [[14.4914, 0, 0, 0, 0, 0, 0, 0], [-6.3486, 16.6642, 0, 0, 0, 0, 0, 0], [4.69245, -4.2132, 12.8152, 0, 0, 0, 0, 0], [-2.27722, 6.9336, -4.37774, 16.2965, 0, 0, 0, 0], [-2.34622, -10.0752, 0.74604, -5.16795, 14.5506, 0, 0, 0], [-0.276026, -3.94573, -6.58037, -3.257, -2.84005, 9.86812, 0, 0], [8.14277, 12.7036, -0.0535879, -3.54515, 6.79111, -1.94966, 5.46098, 0], [-0.414039, 1.82256, -1.27804, 0.173372, -0.395814, 0.314913, -1.63587, 15.1958]] we can verify the cholesky decomposition by multiplication We can create a transpose with the base class DataBlock, but also with mdspan [[210, -92, 68, -33, -34, -4, 118, -6], [-92, 318, -100, 130, -153, -64, 160, 33], [68, -100, 204, -96, 41, -69, -16, -26], [-33, 130, -96, 338, -152, -51, 12, 22], [-34, -153, 41, -152, 346, 11, -30, -25], [-4, -64, -69, -51, 11, 175, -79, 5], [118, 160, -16, 12, -30, -79, 320, 7], [-6, 33, -26, 22, -25, 5, 7, 239]] Now the cholesky decomposition is entirely done on GPU [[14.4914, 0, 0, 0, 0, 0, 0, 0], [-6.3486, 16.6642, 0, 0, 0, 0, 0, 0], [4.69245, -4.2132, 12.8152, 0, 0, 0, 0, 0], [-2.27722, 6.9336, -4.37774, 16.2965, 0, 0, 0, 0], [-2.34622, -10.0752, 0.74604, -5.16795, 14.5506, 0, 0, 0], [-0.276026, -3.94573, -6.58037, -3.257, -2.84005, 9.86812, 0, 0], [8.14277, 12.7036, -0.0535879, -3.54515, 6.79111, -1.94966, 5.46098, 0], [-0.414039, 1.82256, -1.27804, 0.173372, -0.395814, 0.314913, -1.63587, 15.1958]] we can verify the cholesky decomposition by multiplication Here we create the transpose with mdspan [[210, -92, 68, -33, -34, -4, 118, -6], [-92, 318, -100, 130, -153, -64, 160, 33], [68, -100, 204, -96, 41, -69, -16, -26], [-33, 130, -96, 338, -152, -51, 12, 22], [-34, -153, 41, -152, 346, 11, -30, -25], [-4, -64, -69, -51, 11, 175, -79, 5], [118, 160, -16, 12, -30, -79, 320, 7], [-6, 33, -26, 22, -25, 5, 7, 239]] With the advanced algorithms on GPU [[210, -92, 68, -33, -34, -4, 118, -6], [-92, 318, -100, 130, -153, -64, 160, 33], [68, -100, 204, -96, 41, -69, -16, -26], [-33, 130, -96, 338, -152, -51, 12, 22], [-34, -153, 41, -152, 346, 11, -30, -25], [-4, -64, -69, -51, 11, 175, -79, 5], [118, 160, -16, 12, -30, -79, 320, 7], [-6, 33, -26, 22, -25, 5, 7, 239]] [[14.4914, 0, 0, 0, 0, 0, 0, 0], [-6.3486, 16.6642, 0, 0, 0, 0, 0, 0], [4.69245, -4.2132, 12.8152, 0, 0, 0, 0, 0], [-2.27722, 6.9336, -4.37774, 16.2965, 0, 0, 0, 0], [-2.34622, -10.0752, 0.74604, -5.16795, 14.5506, 0, 0, 0], [-0.276026, -3.94573, -6.58037, -3.257, -2.84005, 9.86812, 0, 0], [8.14277, 12.7036, -0.0535879, -3.54515, 6.79111, -1.94966, 5.46098, 0], [-0.414039, 1.82256, -1.27804, 0.173372, -0.395814, 0.314913, -1.63587, 15.1958]] we can verify the cholesky decomposition by multiplication [[210, -92, 68, -33, -34, -4, 118, -6], [-92, 318, -100, 130, -153, -64, 160, 33], [68, -100, 204, -96, 41, -69, -16, -26], [-33, 130, -96, 338, -152, -51, 12, 22], [-34, -153, 41, -152, 346, 11, -30, -25], [-4, -64, -69, -51, 11, 175, -79, 5], [118, 160, -16, 12, -30, -79, 320, 7], [-6, 33, -26, 22, -25, 5, 7, 239]] Now we do the same with the lu decomposition of [[-3, 3, -3, 5, 2, 7, 4, 2], [-2, 4, 2, -10, -4, -2, -10, 1], [-3, 0, 8, 6, -3, -8, -8, -10], [-6, -1, -4, -2, -4, -2, -3, 1], [-9, -10, 5, -6, -8, 1, -3, -8], [-10, -8, -6, 4, 3, -8, -10, -6], [3, -4, -2, 4, 4, -1, 2, 8], [-4, 6, 9, -7, -6, -4, 2, 4]] on CPU [[1, 0, 0, 0, 0, 0, 0, 0], [0.666667, 1, 0, 0, 0, 0, 0, 0], [1, -1.5, 1, 0, 0, 0, 0, 0], [2, -3.5, 0.941176, 1, 0, 0, 0, 0], [3, -9.5, 3.05882, 2.19567, 1, 0, 0, 0], [3.33333, -9, 2.35294, 2.15673, 1.48073, 1, 0, 0], [-1, -0.5, -0.176471, 0.025, 0.206349, 0.178946, 1, 0], [1.33333, 1, 0.529412, -0.238462, 0.015873, -0.0594817, -6.32693, 1]] [[-3, 3, -3, 5, 2, 7, 4, 2], [0, 2, 4, -13.3333, -5.33333, -6.66667, -12.6667, -0.333333], [0, 0, 17, -19, -13, -25, -31, -12.5], [0, 0, 0, -40.7843, -14.4314, -15.8039, -26.1569, 7.59804], [0, 0, 0, 0, 6.78462, 27.8375, 16.9221, 4.38582], [0, 0, 0, 0, 0, -39.6447, -33.0359, -9.13602], [0, 0, 0, 0, 0, 0, -2.73024, 8.16734], [0, 0, 0, 0, 0, 0, 0, 61.1573]] we can verify the lu decomposition by multiplication [[-3, 3, -3, 5, 2, 7, 4, 2], [-2, 4, 2, -10, -4, -2, -10, 1], [-3, 0, 8, 6, -3, -8, -8, -10], [-6, -1, -4, -2, -4, -2, -3, 1], [-9, -10, 5, -6, -8, 1, -3, -8], [-10, -8, -6, 4, 3, -8, -10, -6], [3, -4, -2, 4, 4, -1, 2, 8], [-4, 6, 9, -7, -6, -4, 2, 4]] Entirely on gpu [[1, 0, 0, 0, 0, 0, 0, 0], [0.666667, 1, 0, 0, 0, 0, 0, 0], [1, -1.5, 1, 0, 0, 0, 0, 0], [2, -3.5, 0.941176, 1, 0, 0, 0, 0], [3, -9.5, 3.05882, 2.19567, 1, 0, 0, 0], [3.33333, -9, 2.35294, 2.15673, 1.48073, 1, 0, 0], [-1, -0.5, -0.176471, 0.025, 0.206349, 0.178946, 1, 0], [1.33333, 1, 0.529412, -0.238462, 0.015873, -0.0594817, -6.32693, 1]] [[-3, 3, -3, 5, 2, 7, 4, 2], [0, 2, 4, -13.3333, -5.33333, -6.66667, -12.6667, -0.333333], [0, 0, 17, -19, -13, -25, -31, -12.5], [0, 0, 0, -40.7843, -14.4314, -15.8039, -26.1569, 7.59804], [0, 0, 0, 0, 6.78462, 27.8375, 16.9221, 4.38582], [0, 0, 0, 0, 0, -39.6447, -33.0359, -9.13602], [0, 0, 0, 0, 0, 0, -2.73024, 8.16734], [0, 0, 0, 0, 0, 0, 0, 61.1573]] we can verify the lu decomposition by multiplication [[-3, 3, -3, 5, 2, 7, 4, 2], [-2, 4, 2, -10, -4, -2, -10, 1], [-3, 0, 8, 6, -3, -8, -8, -10], [-6, -1, -4, -2, -4, -2, -3, 1], [-9, -10, 5, -6, -8, 1, -3, -8], [-10, -8, -6, 4, 3, -8, -10, -6], [3, -4, -2, 4, 4, -1, 2, 8], [-4, 6, 9, -7, -6, -4, 2, 4]] With the advanced algorithms on GPU [[1, 0, 0, 0, 0, 0, 0, 0], [0.666667, 1, 0, 0, 0, 0, 0, 0], [1, -1.5, 1, 0, 0, 0, 0, 0], [2, -3.5, 0.941176, 1, 0, 0, 0, 0], [3, -9.5, 3.05882, 2.19567, 1, 0, 0, 0], [3.33333, -9, 2.35294, 2.15673, 1.48073, 1, 0, 0], [-1, -0.5, -0.176471, 0.025, 0.206349, 0.178946, 1, 0], [1.33333, 1, 0.529412, -0.238462, 0.015873, -0.0594817, -6.32693, 1]] we can verify the lu decomposition by multiplication [[-3, 3, -3, 5, 2, 7, 4, 2], [-2, 4, 2, -10, -4, -2, -10, 1], [-3, 0, 8, 6, -3, -8, -8, -10], [-6, -1, -4, -2, -4, -2, -3, 1], [-9, -10, 5, -6, -8, 1, -3, -8], [-10, -8, -6, 4, 3, -8, -10, -6], [3, -4, -2, 4, 4, -1, 2, 8], [-4, 6, 9, -7, -6, -4, 2, 4]] Now we do the same with the qr decomposition [[-4, 9, 4, 0, -3, -4, 8, 0], [0, -7, -3, -8, -9, 1, -5, -9], [-10, 1, 1, 6, -1, 5, 4, 4], [8, 1, 9, -8, -6, 8, -4, -2], [-4, 7, -7, 3, 7, -2, -9, 9], [4, -4, 1, -3, 4, -8, 3, 6], [-7, 7, -3, -7, -9, -5, -1, -7], [7, 1, -9, -1, -7, 3, 5, 4]] On cpu [[-0.227185, 0.526694, 0.290362, -0.0520073, -0.176765, -0.456512, -0.0592619, 0.583785], [0, -0.498224, -0.196168, -0.61069, -0.117755, 0.128255, -0.106856, 0.546457], [-0.567962, -0.213524, 0.137142, 0.194743, -0.395699, 0.23837, 0.599478, 0.0481964], [0.454369, 0.298934, 0.520144, -0.313417, -0.0917011, 0.511886, 0.254197, 0.0188181], [-0.227185, 0.384344, -0.41661, 0.0823878, 0.528548, 0.471375, 0.125855, 0.320809], [0.227185, -0.17082, 0.0331166, -0.118499, 0.476134, -0.4619, 0.677855, 0.0672778], [-0.397573, 0.298934, -0.138174, -0.682534, -0.00122032, -0.11488, 0.0681924, -0.49978], [0.397573, 0.270464, -0.627772, 0.0389353, -0.53276, -0.0869285, 0.284785, 0.0260316]] [[17.6068, -7.04273, 2.04466, -6.0204, -1.36311, 3.52136, -0.795147, 0.511166], [-8.88178e-16, 14.0499, -0.11388, -0.384344, -1.25268, -1.36656, 1.73667, 4.45554], [-8.88178e-16, -1.11022e-16, 15.5822, -1.52315, 0.490587, 2.86508, 2.61988, -3.82087], [-3.88578e-16, -1.06859e-15, -1.72085e-15, 13.9028, 13.311, 2.37641, 4.45025, 11.866], [-8.88178e-16, 1.77636e-15, 8.88178e-16, 1.88738e-15, 11.8807, -8.58115, -8.03244, 5.15164], [-1.11022e-15, 1.9984e-15, 8.88178e-16, 1.77636e-15, 2.22045e-15, 10.3073, -11.3353, 0.702835], [-6.21725e-15, 1.19349e-14, 8.43769e-15, 1.11577e-14, 4.66294e-15, 1.14353e-14, 3.69791, 8.71283], [2.77001e-14, -5.00294e-14, -4.15779e-14, -4.54498e-14, -2.14828e-14, -4.61714e-14, -4.66294e-15, 2.13057]] we can verify the qr decomposition by multiplication [[-4, 9, 4, -2.83387e-14, -3, -4, 8, -1.86517e-14], [1.66175e-14, -7, -3, -8, -9, 1, -5, -9], [-10, 1, 1, 6, -1, 5, 4, 4], [8, 1, 9, -8, -6, 8, -4, -2], [-4, 7, -7, 3, 7, -2, -9, 9], [4, -4, 1, -3, 4, -8, 3, 6], [-7, 7, -3, -7, -9, -5, -1, -7], [7, 1, -9, -1, -7, 3, 5, 4]] On gpu [[-0.227185, 0.526694, 0.290362, -0.0520073, -0.176765, -0.456512, -0.0592619, 0.583785], [0, -0.498224, -0.196168, -0.61069, -0.117755, 0.128255, -0.106856, 0.546457], [-0.567962, -0.213524, 0.137142, 0.194743, -0.395699, 0.23837, 0.599478, 0.0481964], [0.454369, 0.298934, 0.520144, -0.313417, -0.0917011, 0.511886, 0.254197, 0.0188181], [-0.227185, 0.384344, -0.41661, 0.0823878, 0.528548, 0.471375, 0.125855, 0.320809], [0.227185, -0.17082, 0.0331166, -0.118499, 0.476134, -0.4619, 0.677855, 0.0672778], [-0.397573, 0.298934, -0.138174, -0.682534, -0.00122032, -0.11488, 0.0681924, -0.49978], [0.397573, 0.270464, -0.627772, 0.0389353, -0.53276, -0.0869285, 0.284785, 0.0260316]] [[17.6068, -7.04273, 2.04466, -6.0204, -1.36311, 3.52136, -0.795147, 0.511166], [-8.88178e-16, 14.0499, -0.11388, -0.384344, -1.25268, -1.36656, 1.73667, 4.45554], [-8.88178e-16, -1.11022e-16, 15.5822, -1.52315, 0.490587, 2.86508, 2.61988, -3.82087], [-3.88578e-16, -1.06859e-15, -1.72085e-15, 13.9028, 13.311, 2.37641, 4.45025, 11.866], [-8.88178e-16, 1.77636e-15, 8.88178e-16, 1.88738e-15, 11.8807, -8.58115, -8.03244, 5.15164], [-1.11022e-15, 1.9984e-15, 8.88178e-16, 1.77636e-15, 2.22045e-15, 10.3073, -11.3353, 0.702835], [-6.21725e-15, 1.19349e-14, 8.43769e-15, 1.11577e-14, 4.66294e-15, 1.14353e-14, 3.69791, 8.71283], [2.77001e-14, -5.00294e-14, -4.15779e-14, -4.54498e-14, -2.14828e-14, -4.61714e-14, -4.66294e-15, 2.13057]] we can verify the qr decomposition by multiplication [[-4, 9, 4, -2.83387e-14, -3, -4, 8, -1.86517e-14], [1.66175e-14, -7, -3, -8, -9, 1, -5, -9], [-10, 1, 1, 6, -1, 5, 4, 4], [8, 1, 9, -8, -6, 8, -4, -2], [-4, 7, -7, 3, 7, -2, -9, 9], [4, -4, 1, -3, 4, -8, 3, 6], [-7, 7, -3, -7, -9, -5, -1, -7], [7, 1, -9, -1, -7, 3, 5, 4]] with the advanced algorithms on gpu [[-0.227185, 0.526694, 0.290362, -0.0520073, -0.176765, -0.456512, -0.0592619, 0.583785], [0, -0.498224, -0.196168, -0.61069, -0.117755, 0.128255, -0.106856, 0.546457], [-0.567962, -0.213524, 0.137142, 0.194743, -0.395699, 0.23837, 0.599478, 0.0481964], [0.454369, 0.298934, 0.520144, -0.313417, -0.0917011, 0.511886, 0.254197, 0.0188181], [-0.227185, 0.384344, -0.41661, 0.0823878, 0.528548, 0.471375, 0.125855, 0.320809], [0.227185, -0.17082, 0.0331166, -0.118499, 0.476134, -0.4619, 0.677855, 0.0672778], [-0.397573, 0.298934, -0.138174, -0.682534, -0.00122032, -0.11488, 0.0681924, -0.49978], [0.397573, 0.270464, -0.627772, 0.0389353, -0.53276, -0.0869285, 0.284785, 0.0260316]] [[17.6068, -7.04273, 2.04466, -6.0204, -1.36311, 3.52136, -0.795147, 0.511166], [1.11022e-16, 14.0499, -0.11388, -0.384344, -1.25268, -1.36656, 1.73667, 4.45554], [-1.55431e-15, 1.11022e-16, 15.5822, -1.52315, 0.490587, 2.86508, 2.61988, -3.82087], [-2.05391e-15, 1.60982e-15, -2.05391e-15, 13.9028, 13.311, 2.37641, 4.45025, 11.866], [1.66533e-15, -1.77636e-15, 3.33067e-16, -3.21965e-15, 11.8807, -8.58115, -8.03244, 5.15164], [4.08007e-15, -2.31759e-15, 1.97065e-15, -4.34375e-15, -1.85962e-15, 10.3073, -11.3353, 0.702835], [1.83742e-14, -1.39888e-14, 4.16334e-15, -2.16493e-14, -1.19349e-14, -5.4956e-15, 3.69791, 8.71283], [-7.1762e-14, 5.78843e-14, -1.53627e-14, 8.20038e-14, 4.60049e-14, 2.19963e-14, 6.09235e-15, 2.13057]] we can verify the qr decomposition by multiplication [[-4, 9, 4, 5.15967e-14, -3, -4, 8, 3.37508e-14], [-3.93472e-14, -7, -3, -8, -9, 1, -5, -9], [-10, 1, 1, 6, -1, 5, 4, 4], [8, 1, 9, -8, -6, 8, -4, -2], [-4, 7, -7, 3, 7, -2, -9, 9], [4, -4, 1, -3, 4, -8, 3, 6], [-7, 7, -3, -7, -9, -5, -1, -7], [7, 1, -9, -1, -7, 3, 5, 4]]
