https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63599
--- Comment #2 from vincenzo Innocente <vincenzo.innocente at cern dot ch> ---
I agree that the code produces correct results. It looks to me sub-optimal.
I understand that with Ofast the sequence below will be always executed
andps %xmm5, %xmm8
rcpps %xmm3, %xmm0
mulps %xmm0, %xmm3
mulps %xmm0, %xmm3
addps %xmm0, %xmm0
subps %xmm3, %xmm0
mulps %xmm0, %xmm1
movaps %xmm2, %xmm0
cmpleps %xmm4, %xmm0
blendvps %xmm0, %xmm2, %xmm1
while with O2 it will not.
and this generates a performance penalty for samples where the test is often
false.
( I tried to add __builtin_expect(x, false) with no effect. )