[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

bonzini at gnu dot org Fri, 11 Aug 2006 03:22:39 -0700


------- Comment #4 from bonzini at gnu dot org  2006-08-11 10:22 -------
Except that PPC uses 12 registers f0 f6 f7 f8 f9 f10 f11 f12 f13 f29 f30 f31. 
Not that we can blame GCC for using 12, but it is not a fair comparison. :-)


In fact, 8 registers are enough, but it is quite tricky to obtain them.
The problem is that v3[xyz] is live across multiple BB's, making the task of
the register allocator quite harder.  Even if we change v3[xyz] in the printf
to v2[xyz], cfg-cleanup (between vrp1 and dce2) replaces it and, in doing so,
it extends the lifetime of v3[xyz].

(Since it's all about having short lifetimes, CCing [EMAIL PROTECTED])

BTW, here is the optimal code (if it works...):

ENTER basic block: v1[xyz], v2[xyz] are live (6 registers)

      v3x = v1y * v2z - v1z * v2y;

v3x is now live, and it takes 2 registers to compute this statement.  Here we
hit a maximum of 8 live registers.  After the statement 7 registers are live.

      v3y = v1z * v2x - v1x * v2z;

v1z dies here, so we need only one additional register for this statement.  We
also hit a maximum of 8 live registers.  At the end of the statement, 7
registers are also live (7 - 1 v1z that dies + 1 for v3y)

      v3z = v1x * v2y - v1y * v2x;

Likewise, v1x and v1y die, so we need 7 registers and, at the end of the
statement, 6 registers are also live.

Optimal code would be like this (%xmm0..2 = v1[xyz], %xmm3..5 = v2[xyz])

v3x = v1y * v2z - v1z * v2y
      movss %xmm1, %xmm6
      mulss %xmm5, %xmm6 ;; v1y * v2z in %xmm6
      movss %xmm2, %xmm7
      mulss %xmm4, %xmm7 ;; v1z * v2y in %xmm7
      subss %xmm7, %xmm6 ;; v3x in %xmm6

v3y = v1z * v2x - v1x * v2z
      mulss %xmm3, %xmm2 ;; v1z dies, v1z * v2x in %xmm2
      movss %xmm1, %xmm7
      mulss %xmm5, %xmm7 ;; v1x * v2z in %xmm7
      subss %xmm7, %xmm2 ;; v3y in %xmm2

v3z = v1x * v2y - v1y * v2x
      mulss %xmm4, %xmm0 ;; v1x dies, v1x * x2y in %xmm0
      mulss %xmm3, %xmm1 ;; v1y dies, v1y * v2x in %xmm1
      subss %xmm1, %xmm0 ;; v3z in %xmm0

Note now how we should reorder the final moves to obtain optimal code!

      movss %xmm0, %xmm7 ;; save v3z... alternatively, do it before the subss

      movss %xmm3, %xmm0 ;; v1x = v2x
      movss %xmm6, %xmm3 ;; v2x = v3x (in %xmm6)
      movss %xmm4, %xmm1 ;; v1y = v2y
      movss %xmm2, %xmm4 ;; v2y = v3y (in %xmm2)
      movss %xmm5, %xmm2 ;; v1z = v2z
      movss %xmm7, %xmm5 ;; v2z = v3z (saved in %xmm7)

(Note that doing the reordering manually does not help...) :-(  Out of
curiosity, can somebody check out yara-branch to see how it fares?


---

By comparison, the x87 is relatively easier, because there are never more than
8 registers and fxch makes it much easier to write the compensation code:

v3x = v1y * v2z - v1z * v2y
                            ;; v1x v1y v1z v2x v2y v2z
       fld %st(1)           ;; v1y v1x v1y v1z v2x v2y v2z
       fmul %st(6), %st(0)  ;; v1y*v2z v1x v1y v1z v2x v2y v2z
       fld %st(3)           ;; v1z v1y*v2z v1x v1y v1z v2x v2y v2z
       fmul %st(6), %st(0)  ;; v1z*v2y v1y*v2z v1x v1y v1z v2x v2y v2z
       fsubp %st(0), %st(1) ;; v3x v1x v1y v1z v2x v2y v2z

v3y = v1z * v2x - v1x * v2z
       fld %st(4)           ;; v2x v3x v1x v1y v1z v2x v2y v2z
       fmulp %st(0), %st(4) ;; v3x v1x v1y v1z*v2x v2x v2y v2z
       fld %st(1)           ;; v1x v3x v1x v1y v1z*v2x v2x v2y v2z
       fmul %st(7), %st(0)  ;; v1x*v2z v3x v1x v1y v1z*v2x v2x v2y v2z
       fsubp %st(0), %st(4) ;; v3x v1x v1y v3y v2x v2y v2z

v3z = v1x * v2y - v1y * v2x
       fld %st(5)           ;; v2y v3x v1x v1y v3y v2x v2y v2z
       fmulp %st(0), %st(2) ;; v3x v1x*v2y v1y v3y v2x v2y v2z
       fld %st(4)           ;; v2x v3x v1x*v2y v1y v3y v2x v2y v2z
       fmul %st(3), %st(0)  ;; v1y*v2x v3x v1x*v2y v1y v3y v2x v2y v2z
       fsubp %st(0), %st(2) ;; v3x v3z v1y v3y v2x v2y v2z
       fstp %st(2)          ;; v3z v3x v3y v2x v2y v2z

       fxch %st(5)          ;; v2z v3x v3y v2x v2y v3z
       fxch %st(2)          ;; v3y v3x v2z v2x v2y v3z
       fxch %st(4)          ;; v2y v3x v2z v2x v3y v3z
       fxch %st(1)          ;; v3x v2y v2z v2x v3y v3z
       fxch %st(0)          ;; v2x v2y v2z v3x v3y v3z

(well, the fxch should be scheduled, but still it is possible to do it without
spilling).

Paolo


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amacleod at redhat dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780

[Bug rtl-optimization/19780] Floating point computation far slower for -mfpmath=sse

Reply via email to