https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118505

--- Comment #5 from Dhruv Chawla <dhruvc at nvidia dot com> ---
(In reply to Andrew Pinski from comment #3)
> Note there is also a fma forming missing:
>   _69 = s_64 + 1.0e+0;
>   ...
>   _71 = _69 * _70;
> 
> which is:
>   `(s_64 + 1.0) * _70` which can be rewritten as `s_64 * _70 + _70`
> 
> That might alone get the performance back up. I should note that LLVM also
> does the fcsel but with changing of the 2 instruction `(a+1) * b` into one
> fma instruction `a*b + b`.

I tried doing this, via:

        fcmpe   s2, #0.0
        fmul    s1, s30, s30
        fcsel   s31, s1, s31, gt
        fmadd   s0, s31, s0, s30
        str     s0, [x21, x0]
        ldr     s29, [x19, x0]
        fmadd   s29, s31, s29, s29
        str     s29, [x20, x0]

I don't really see a performance impact. Also, it seems that clang's codegen is
still a bit slower than the split paths.

Reply via email to