https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120865
--- Comment #7 from Benjamin Schulz <schulz.benjamin at googlemail dot com> --- the correct output is now this one without -O The internal compiler error is at all o levels, -O1, O2, O3.. Interestingly -ffast-math works and leads to a considerable speed-up. Just using -fno-math-errno -fno-trapping-math leads to a speed-up as well. I find it interesting that even if I put in -fno-signed-zeros, one can find -0 in the output of the LU decomposition. Ordinary matrix multiplication, on gpu 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 80 90 100 110 176 202 228 254 272 314 356 398 368 426 484 542 A Cholesky decomposition with the multiplication on gpu 4 12 -16 12 37 -43 -16 -43 98 2 0 0 6 1 0 -8 5 3 Now the cholesky decomposition is entirely done on gpu 2 0 0 6 1 0 -8 5 3 Now we do the same with the lu decomposition 1 -2 -2 -3 3 -9 0 -9 -1 2 4 7 -3 -6 26 2 Just the multiplication on gpu 1 0 0 0 3 1 0 0 -1 -0 1 0 -3 4 -2 1 1 -2 -2 -3 0 -3 6 0 0 0 2 4 0 0 0 1 Entirely on gpu 1 0 0 0 3 1 0 0 -1 -0 1 0 -3 4 -2 1 1 -2 -2 -3 0 -3 6 0 0 0 2 4 0 0 0 1 Now we do the same with the qr decomposition 12 -51 4 6 167 -68 -4 24 -41 Just the multiplication on gpu 0.857143 -0.394286 -0.331429 0.428571 0.902857 0.0342857 -0.285714 0.171429 -0.942857 14 21 -14 -1.11022e-16 175 -70 -3.10862e-15 -4.79616e-14 35 Entirely on gpu 0.857143 -0.394286 -0.331429 0.428571 0.902857 0.0342857 -0.285714 0.171429 -0.942857 14 21 -14 -1.11022e-16 175 -70 -1.33227e-15 -1.95399e-14 35 In order to test the qr decomposition, we can use Strassen's algorithm 12 -51 4 6 167 -68 -4 24 -41 or its Winograd variant, with the smaller matrices computed on gpu 12 -51 4 6 167 -68 -4 24 -41