https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81478

--- Comment #4 from Sean McAllister <smcallis at gmail dot com> ---
Looking at the assembly for the __mulsc3 function:

 <+0>:movaps %xmm0,%xmm10
 <+4>:movaps %xmm2,%xmm11
 <+8>:movaps %xmm0,%xmm5
<+11>:mulss  %xmm3,%xmm10
<+16>:movaps %xmm1,%xmm6
<+19>:mulss  %xmm1,%xmm11
<+24>:mulss  %xmm2,%xmm5
<+28>:mulss  %xmm3,%xmm6
<+32>:movaps %xmm10,%xmm4
<+36>:addss  %xmm11,%xmm4
<+41>:movaps %xmm5,%xmm9
<+45>:subss  %xmm6,%xmm9
<+50>:ucomiss %xmm4,%xmm4
<+53>:setp   %al
<+56>:ucomiss %xmm9,%xmm9
<+60>:setp   %dl
<+63>:and    %dl,%al
<+65>:jne    0x7ffff7530a27 <__mulsc3+87>
<snip>

The isnan(a) && isnan(b) isn't short-circuited.  It'd be possible to write
something like this:

<snip>
<+50>:ucomiss %xmm4,%xmm4
<+53>:setp    %al
<+XX>:je      good_cxmultiply
<+XX>:ucomiss %xmm9,%xmm9
<+XX>:setp    %dl
<+XX>:and     %dl,%al
<+XX>:jne     0x7ffff7530a27 <__mulsc3+XX>
<+XX>good_cxmultiply:
<snip>

This makes the overhead in the general case three pretty cheap instructions
instead of 6 (also very cheap), someone smarter than me will have to decide if
that's a net win or not. Also emitting the code instead of calling __mulsc3
every time will also benefit the register allocator and give it options for
shuffling things around. (I do a lot of complex arithmetic so I'm interested in
this being fast =D).  It'd be cool if the vectorizer still had a shot at it,
but I don't immediately see an easy way to achieve that.

Reply via email to