https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118505
--- Comment #5 from Dhruv Chawla <dhruvc at nvidia dot com> --- (In reply to Andrew Pinski from comment #3) > Note there is also a fma forming missing: > _69 = s_64 + 1.0e+0; > ... > _71 = _69 * _70; > > which is: > `(s_64 + 1.0) * _70` which can be rewritten as `s_64 * _70 + _70` > > That might alone get the performance back up. I should note that LLVM also > does the fcsel but with changing of the 2 instruction `(a+1) * b` into one > fma instruction `a*b + b`. I tried doing this, via: fcmpe s2, #0.0 fmul s1, s30, s30 fcsel s31, s1, s31, gt fmadd s0, s31, s0, s30 str s0, [x21, x0] ldr s29, [x19, x0] fmadd s29, s31, s29, s29 str s29, [x20, x0] I don't really see a performance impact. Also, it seems that clang's codegen is still a bit slower than the split paths.