https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116630

            Bug ID: 116630
           Summary: Implement spaceshipm3 optab for aarch64
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

For the testcase:
#include <compare>
auto cmp4way(double a, double b)
{
  return a <=> b;
}

auto cmp4wayf(float a, float b)
{
  return a <=> b;
}

with -O2 -std=c++20 we can generate better code for aarch64 than we do now:
cmp4way(double, double):
        fcmp    d0, d1
        mov     w0, 0
        beq     .L2
        fcmpe   d0, d1
        mov     w0, -1
        bmi     .L2
        cset    w0, gt
        mov     w1, 2
        sub     w0, w1, w0
.L2:
        ret
cmp4wayf(float, float):
        fcmp    s0, s1
        mov     w0, 0
        beq     .L8
        fcmpe   s0, s1
        mov     w0, -1
        bmi     .L8
        cset    w0, gt
        mov     w1, 2
        sub     w0, w1, w0
.L8:
        ret

LLVM generates:
cmp4way(double, double):
        fcmp    d0, d1
        cset    w8, ne
        lsl     w8, w8, #1
        csinc   w8, w8, wzr, le
        csinv   w0, w8, wzr, pl
        ret

cmp4wayf(float, float):
        fcmp    s0, s1
        cset    w8, ne
        lsl     w8, w8, #1
        csinc   w8, w8, wzr, le
        csinv   w0, w8, wzr, pl
        ret

I guess the LLVM sequence executes more instructions on average but it avoids
conditional branches that risk being mispredicted and is lower codesize overall
so should at least be preferred for -Os.
It also has only one fcmp instruction that tends to be lower throughput than
the  other cheap GP insns whereas GCC can end up executing two fcmp.

Reply via email to