------- Comment #4 from bonzini at gnu dot org 2006-08-11 10:22 ------- Except that PPC uses 12 registers f0 f6 f7 f8 f9 f10 f11 f12 f13 f29 f30 f31. Not that we can blame GCC for using 12, but it is not a fair comparison. :-)
In fact, 8 registers are enough, but it is quite tricky to obtain them. The problem is that v3[xyz] is live across multiple BB's, making the task of the register allocator quite harder. Even if we change v3[xyz] in the printf to v2[xyz], cfg-cleanup (between vrp1 and dce2) replaces it and, in doing so, it extends the lifetime of v3[xyz]. (Since it's all about having short lifetimes, CCing [EMAIL PROTECTED]) BTW, here is the optimal code (if it works...): ENTER basic block: v1[xyz], v2[xyz] are live (6 registers) v3x = v1y * v2z - v1z * v2y; v3x is now live, and it takes 2 registers to compute this statement. Here we hit a maximum of 8 live registers. After the statement 7 registers are live. v3y = v1z * v2x - v1x * v2z; v1z dies here, so we need only one additional register for this statement. We also hit a maximum of 8 live registers. At the end of the statement, 7 registers are also live (7 - 1 v1z that dies + 1 for v3y) v3z = v1x * v2y - v1y * v2x; Likewise, v1x and v1y die, so we need 7 registers and, at the end of the statement, 6 registers are also live. Optimal code would be like this (%xmm0..2 = v1[xyz], %xmm3..5 = v2[xyz]) v3x = v1y * v2z - v1z * v2y movss %xmm1, %xmm6 mulss %xmm5, %xmm6 ;; v1y * v2z in %xmm6 movss %xmm2, %xmm7 mulss %xmm4, %xmm7 ;; v1z * v2y in %xmm7 subss %xmm7, %xmm6 ;; v3x in %xmm6 v3y = v1z * v2x - v1x * v2z mulss %xmm3, %xmm2 ;; v1z dies, v1z * v2x in %xmm2 movss %xmm1, %xmm7 mulss %xmm5, %xmm7 ;; v1x * v2z in %xmm7 subss %xmm7, %xmm2 ;; v3y in %xmm2 v3z = v1x * v2y - v1y * v2x mulss %xmm4, %xmm0 ;; v1x dies, v1x * x2y in %xmm0 mulss %xmm3, %xmm1 ;; v1y dies, v1y * v2x in %xmm1 subss %xmm1, %xmm0 ;; v3z in %xmm0 Note now how we should reorder the final moves to obtain optimal code! movss %xmm0, %xmm7 ;; save v3z... alternatively, do it before the subss movss %xmm3, %xmm0 ;; v1x = v2x movss %xmm6, %xmm3 ;; v2x = v3x (in %xmm6) movss %xmm4, %xmm1 ;; v1y = v2y movss %xmm2, %xmm4 ;; v2y = v3y (in %xmm2) movss %xmm5, %xmm2 ;; v1z = v2z movss %xmm7, %xmm5 ;; v2z = v3z (saved in %xmm7) (Note that doing the reordering manually does not help...) :-( Out of curiosity, can somebody check out yara-branch to see how it fares? --- By comparison, the x87 is relatively easier, because there are never more than 8 registers and fxch makes it much easier to write the compensation code: v3x = v1y * v2z - v1z * v2y ;; v1x v1y v1z v2x v2y v2z fld %st(1) ;; v1y v1x v1y v1z v2x v2y v2z fmul %st(6), %st(0) ;; v1y*v2z v1x v1y v1z v2x v2y v2z fld %st(3) ;; v1z v1y*v2z v1x v1y v1z v2x v2y v2z fmul %st(6), %st(0) ;; v1z*v2y v1y*v2z v1x v1y v1z v2x v2y v2z fsubp %st(0), %st(1) ;; v3x v1x v1y v1z v2x v2y v2z v3y = v1z * v2x - v1x * v2z fld %st(4) ;; v2x v3x v1x v1y v1z v2x v2y v2z fmulp %st(0), %st(4) ;; v3x v1x v1y v1z*v2x v2x v2y v2z fld %st(1) ;; v1x v3x v1x v1y v1z*v2x v2x v2y v2z fmul %st(7), %st(0) ;; v1x*v2z v3x v1x v1y v1z*v2x v2x v2y v2z fsubp %st(0), %st(4) ;; v3x v1x v1y v3y v2x v2y v2z v3z = v1x * v2y - v1y * v2x fld %st(5) ;; v2y v3x v1x v1y v3y v2x v2y v2z fmulp %st(0), %st(2) ;; v3x v1x*v2y v1y v3y v2x v2y v2z fld %st(4) ;; v2x v3x v1x*v2y v1y v3y v2x v2y v2z fmul %st(3), %st(0) ;; v1y*v2x v3x v1x*v2y v1y v3y v2x v2y v2z fsubp %st(0), %st(2) ;; v3x v3z v1y v3y v2x v2y v2z fstp %st(2) ;; v3z v3x v3y v2x v2y v2z fxch %st(5) ;; v2z v3x v3y v2x v2y v3z fxch %st(2) ;; v3y v3x v2z v2x v2y v3z fxch %st(4) ;; v2y v3x v2z v2x v3y v3z fxch %st(1) ;; v3x v2y v2z v2x v3y v3z fxch %st(0) ;; v2x v2y v2z v3x v3y v3z (well, the fxch should be scheduled, but still it is possible to do it without spilling). Paolo -- bonzini at gnu dot org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amacleod at redhat dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780