https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120865

--- Comment #7 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
the correct output is now this one without -O 

The internal compiler error is at all o levels, -O1, O2, O3..

Interestingly -ffast-math works and leads to a considerable speed-up.
Just using -fno-math-errno -fno-trapping-math

leads to a speed-up as well.

I find it interesting that even if I put in -fno-signed-zeros, one can find 
-0 in the output of the LU decomposition.


Ordinary matrix multiplication, on gpu
1 2 3 4 
5 6 7 8 
9 10 11 12 
13 14 15 16 

0 1 2 3 
4 5 6 7 
8 9 10 11 
12 13 14 15 

80 90 100 110 
176 202 228 254 
272 314 356 398 
368 426 484 542 

A Cholesky decomposition with the multiplication on gpu
4 12 -16 
12 37 -43 
-16 -43 98 

2 0 0 
6 1 0 
-8 5 3 

Now the cholesky decomposition is entirely done on gpu
2 0 0 
6 1 0 
-8 5 3 

Now we do the same with the lu decomposition
1 -2 -2 -3 
3 -9 0 -9 
-1 2 4 7 
-3 -6 26 2 

Just the multiplication on gpu
1 0 0 0 
3 1 0 0 
-1 -0 1 0 
-3 4 -2 1 

1 -2 -2 -3 
0 -3 6 0 
0 0 2 4 
0 0 0 1 

Entirely on gpu
1 0 0 0 
3 1 0 0 
-1 -0 1 0 
-3 4 -2 1 

1 -2 -2 -3 
0 -3 6 0 
0 0 2 4 
0 0 0 1 

Now we do the same with the qr decomposition
12 -51 4 
6 167 -68 
-4 24 -41 

Just the multiplication on gpu
0.857143 -0.394286 -0.331429 
0.428571 0.902857 0.0342857 
-0.285714 0.171429 -0.942857 

14 21 -14 
-1.11022e-16 175 -70 
-3.10862e-15 -4.79616e-14 35 

Entirely on gpu
0.857143 -0.394286 -0.331429 
0.428571 0.902857 0.0342857 
-0.285714 0.171429 -0.942857 

14 21 -14 
-1.11022e-16 175 -70 
-1.33227e-15 -1.95399e-14 35 

In order to test the qr decomposition, we can use Strassen's algorithm
12 -51 4 
6 167 -68 
-4 24 -41 

or its Winograd variant, with the smaller matrices computed on gpu
12 -51 4 
6 167 -68 
-4 24 -41

Reply via email to