https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67510
--- Comment #3 from Peter Cordes <pcordes at gmail dot com> --- (In reply to Andrew Pinski from comment #2) > Fixed by r10-5498 and r11-5429 . Looks ok for -mtune=generic and non-Intel, but MOV/NEG isn't ideal on some microarchitectures. We still aren't using CMOV for -mtune=broadwell or later where it's single-uop with 1 cycle latency, same as AMD. https://uops.info/ (CMOVBE/CMOVA which still 2 uops on Intel P-cores, since they read CF, and ZF from the SPAZO group, so 4 total inputs. But absolute value only uses signed conditions, which don't involve CF.) Intel E-cores (silvermont family) also have efficient CMOV. Latency = 1 cycle from one input, 2 cycles from the other, but single uop for the front-end. It looks like it uses two execution pipes like INC does, but often the front-end is the bottleneck). So we should be using it there, too. The non-CMOV code-gen has even higher latency and worse throughput. Earlier Intel SnB-family (Sandy / Ivy Bridge and Haswell) might be better with CMOV, too, since the uop cache reduces the downside of a 2-uop instruction only being able to decode in the first decoder (so maybe keep the current code-gen on P6-family?), but it's 4 uops either way and smaller machine-code size, and better on later CPUs. ---- The current CMOV code-gen has MOV on the latency critical path for microarchitectures without mov-elimination. (bdver* and earlier AMD where we currently use CMOV, and Ice Lake, first-gen Sandybridge, and P6 family where we currently don't. I haven't checked on Zhaoxin.) https://gcc.godbolt.org/z/Mbsqzhbfs movl %edi, %eax negl %eax cmovs %edi, %eax This could be easily avoided, either with XOR/SUB like I suggested in my original report, or by NEGating the original value instead of the copy, for CMOVNS. CMOV would still be overwriting the MOV result so would still free up mov-elimination resources promptly. (https://stackoverflow.com/questions/75204302/how-do-move-elimination-slots-work-in-intel-cpu ). But unlike XOR/SUB, that destroys the original value so isn't an option if the input register is needed later. And with APX available, NEG into a new register is optimal. So it might be easier to only consider patterns that produce the negated value in a separate register, which should be easy to peephole optimize with APX non-destructive operations. XOR/SUB saves code-size for absolute value of 64-bit integers; the xor-zeroing doesn't need a REX prefix if the destination is a low reg. e.g # better version for most CPUs, especially for 64-bit operand-size xor %eax, %eax # off the critical path on all x86-64, eliminated on recent uarches sub %rdi, %rax # REX prefix cmovs %rdi, %rax # REX prefix In terms of back-end execution-unit cost: * Zen 3 and later eliminate both MOV and XOR-zeroing, so both should be equal performance except for the code-size difference with 64-bit operands. * Zen 1 and 2 eliminate MOV but not XOR-zeroing. (XOR-zero is dep-breaking so it's not on the critical path, but still needs an execution unit to write the zero, so ~4 per clock throughput) * Gracemont E-cores eliminates MOV but not XOR-zeroing, like Zen 1 and 2. Its front-end is probably more likely to be a bottleneck so this is even less of a big deal. https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/ is my source for this and Zen 1/2 vs. Zen 3. https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/ has Zen 4 numbers, although its Sunny Cove (Ice Lake) numbers don't seem to reflect the microcode update that disabled mov-elim for GPRs * Tremont and earlier Silvermont family: MOV is not eliminated, so XOR-zeroing is better. * Ice Lake (and first-gen Sandy Bridge) eliminate XOR-zeroing but not integer MOV, so the current tune=generic code-gen has worse latency than necessary. (Thanks to Intel's microcode workarounds for CPU errata, there will be CPUs with non-zero latency for mov reg,reg in service for many more years than previously expected.) * Ivy Bridge and later except Ice Lake family (Ice / Tiger / Rocket) eliminate both so either way is fine, XOR / SUB being preferable because of Ice Lake existing. * P6-family doesn't eliminate either MOV or XOR. P6-family is obsolete enough that it's probably not worth anyone's time to update the tuning stuff for them. But XOR/NEG is clearly better if we're using CMOV at all. (2-uop instructions like CMOVS can be a problem depending on surrounding code, given their lack of a uop cache.) Early P6-family (up to Pentium III) has a false dependency with xor-zeroing: it's special in terms of marking the register as EAX=AL upper-bits-zero to avoid partial register stalls, but isn't independent of the old value of EAX. -mtune=intel and -mtune=generic shouldn't care about this. (Especially not with -m64 since the affected CPUs are 32-bit only). Despite the false dependency, XOR-zero is probably still a good bet for -mtune=pentiumpro / pentium2 / pentium3 if we can pick a register that's likely not to have been written recently. (The reorder buffer isn't huge on those old CPUs.) * Netburst (Pentium 4) doesn't eliminate either, so XOR-zeroing is clearly better, like Pentium-M and later P6. * Via: single-uop CMOV on Nano 3000, 2 uops on Nano 2000. MOV is not eliminated so XOR-zeroing / SUB should be used. * Zhaoxin: I don't know, haven't seen data on these. Presumably XOR/SUB/CMOV is good. So in summary, -mtune=generic should use this: xor %eax, %eax sub %edi, %eax cmovs %edi, %eax It's better for critical-path latency on Ice Lake, and old CPUs like Sandy Bridge and earlier, Bulldozer-family, P6-family, and Tremont and earlier Silvermont-family. It's *slightly* worse (needing an extra back-end uop to execute) on Zen 1 and 2, and Gracemont. But that back-end uop has no dependencies so can execute in any spare cycle on that execution port. And for int64_t, this saves a byte of code-size in the xor-zeroing if the destination isn't R8-R15. Also, -march=broadwell and later should be using CMOV. Current code-gen from -march=icelake has 4 cycle critical-path latency vs. 2 for XOR/SUB/CMOV. movl %edi, %eax # not eliminated on Ice Lake cltd # 1 uop, 1 cycle latency to write EDX xorl %edx, %eax subl %edx, %eax ret Even worse, -march=goldmont avoids CDQ (AT&T cltd), costing an extra uop. It still has 4-cycle latency movl %edi, %edx sarl $31, %edx # could have been cltd after the next MOV movl %edi, %eax xorl %edx, %eax subl %edx, %eax ret CPUs like Goldmont and Gracemont have CMOV with 1 cycle latency from the destination operand, 2 cycle latency from the source and FLAGS. So unfortunately there's no way to get latency down to 2 cycles (we need a cycle to generate the FLAGS input however we arrange the GPR operands). Either XOR/SUB/CMOV or MOV/NEG-original/CMOV have 3-cycle latency, vs. 4 for the current tune=generic code-gen (except on Gracemont where zero-latency MOV keeps it down to 3 cycles). https://uops.info/table.html?search=cmovs%20r32)&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SNB=on&cb_CFL=on&cb_ICL=on&cb_ADLP=on&cb_GLM=on&cb_ADLE=on&cb_ZENp=on&cb_ZEN4=on&cb_measurements=on&cb_base=on https://uops.info/html-lat/GLM/CMOVS_R32_R32-Measurements.html Same goes for Intel P-cores before Broadwell: XOR/SUB/CMOV is a total of 3 cycle latency, vs. 3 or 4 for mov/cltd/xor/sub depending on mov-elimination for the initial mov.