https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81478
--- Comment #4 from Sean McAllister <smcallis at gmail dot com> --- Looking at the assembly for the __mulsc3 function: <+0>:movaps %xmm0,%xmm10 <+4>:movaps %xmm2,%xmm11 <+8>:movaps %xmm0,%xmm5 <+11>:mulss %xmm3,%xmm10 <+16>:movaps %xmm1,%xmm6 <+19>:mulss %xmm1,%xmm11 <+24>:mulss %xmm2,%xmm5 <+28>:mulss %xmm3,%xmm6 <+32>:movaps %xmm10,%xmm4 <+36>:addss %xmm11,%xmm4 <+41>:movaps %xmm5,%xmm9 <+45>:subss %xmm6,%xmm9 <+50>:ucomiss %xmm4,%xmm4 <+53>:setp %al <+56>:ucomiss %xmm9,%xmm9 <+60>:setp %dl <+63>:and %dl,%al <+65>:jne 0x7ffff7530a27 <__mulsc3+87> <snip> The isnan(a) && isnan(b) isn't short-circuited. It'd be possible to write something like this: <snip> <+50>:ucomiss %xmm4,%xmm4 <+53>:setp %al <+XX>:je good_cxmultiply <+XX>:ucomiss %xmm9,%xmm9 <+XX>:setp %dl <+XX>:and %dl,%al <+XX>:jne 0x7ffff7530a27 <__mulsc3+XX> <+XX>good_cxmultiply: <snip> This makes the overhead in the general case three pretty cheap instructions instead of 6 (also very cheap), someone smarter than me will have to decide if that's a net win or not. Also emitting the code instead of calling __mulsc3 every time will also benefit the register allocator and give it options for shuffling things around. (I do a lot of complex arithmetic so I'm interested in this being fast =D). It'd be cool if the vectorizer still had a shot at it, but I don't immediately see an easy way to achieve that.