https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116630
Bug ID: 116630 Summary: Implement spaceshipm3 optab for aarch64 Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 For the testcase: #include <compare> auto cmp4way(double a, double b) { return a <=> b; } auto cmp4wayf(float a, float b) { return a <=> b; } with -O2 -std=c++20 we can generate better code for aarch64 than we do now: cmp4way(double, double): fcmp d0, d1 mov w0, 0 beq .L2 fcmpe d0, d1 mov w0, -1 bmi .L2 cset w0, gt mov w1, 2 sub w0, w1, w0 .L2: ret cmp4wayf(float, float): fcmp s0, s1 mov w0, 0 beq .L8 fcmpe s0, s1 mov w0, -1 bmi .L8 cset w0, gt mov w1, 2 sub w0, w1, w0 .L8: ret LLVM generates: cmp4way(double, double): fcmp d0, d1 cset w8, ne lsl w8, w8, #1 csinc w8, w8, wzr, le csinv w0, w8, wzr, pl ret cmp4wayf(float, float): fcmp s0, s1 cset w8, ne lsl w8, w8, #1 csinc w8, w8, wzr, le csinv w0, w8, wzr, pl ret I guess the LLVM sequence executes more instructions on average but it avoids conditional branches that risk being mispredicted and is lower codesize overall so should at least be preferred for -Os. It also has only one fcmp instruction that tends to be lower throughput than the other cheap GP insns whereas GCC can end up executing two fcmp.