https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81614
Bug ID: 81614 Summary: x86 optimizer combines results of comparisons in a way that risks partial register stalls Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: cody at codygray dot com Target Milestone: --- Target: i?86-*-* Consider the following code: bool foo(int a, int b, int c) { // It doesn't matter if this short-circuits ('||' vs. '|') // because the optimizer treats them as equivalent. return (a == c || b == c); } All versions of GCC (going back to at least 4.4.7 and forward to the current 8.0 preview) translate this to the following optimized assembly on x86 targets: foo(int, int, int): movl 12(%esp), %edx cmpl %edx, 4(%esp) sete %al cmpl 8(%esp), %edx sete %dl orl %edx, %eax ret The problem here is the second-to-last instruction. It ORs together two full 32-bit registers, even though the preceding SETE instructions only set the low 8 bits of each register. This results in a speed-zapping phenomenon on virtually all x86 processors called a *partial register stall*. (See http://www.agner.org/optimize/microarchitecture.pdf for details on exactly how this is a performance problem on various implementations of x86. Although there are differences in exactly *why* it is a speed penalty, it virtually always is and *certainly* should be considered one when the output is tuned for a generic x86 target.) You get the same results at all optimization levels, including -Os (at least, the relevant portion of the code is the same). You also see this for x86-64 targets: foo(int, int, int): cmpl %edx, %edi sete %al cmpl %esi, %edx sete %dl orl %edx, %eax ret One of two things should be done instead: either (A) perform the bitwise operation *only* on the low bytes, or (B) pre-zero the entire 32-bit register *before* setting its low byte to break dependencies. Proposed Resolution A (use only low bytes): foo(int, int, int): movl 12(%esp), %edx cmpl %edx, 4(%esp) sete %al cmpl 8(%esp), %edx sete %dl orl %dl, %al ret Proposed Resolution B (pre-zero to break dependencies): foo(int, int, int): movl 12(%esp), %edx xorl %eax, %eax cmpl %edx, 4(%esp) sete %al xorl %ecx, %ecx cmpl 8(%esp), %edx sete %cl orl %ecx, %eax ret Approach A is the one used by Clang and MSVC. It solves the problem of partial register stalls while avoiding the need for a third register as in Approach B. The disadvantage of Approach A is that it creates only a byte-sized (8-bit) result. This is perfectly fine if the function returns a bool, but doesn't work if the function returns an integer type. There are two ways to solve that. What GCC currently does if you change foo() to return int is add a MOVZBL instruction between the OR and RET: foo(int, int, int): movl 12(%esp), %edx cmpl %edx, 4(%esp) sete %al cmpl 8(%esp), %edx sete %dl orl %edx, %eax movzbl %al, %eax ret This zero-extends the result in AL into EAX. (Notice that the partial register stall hazard is still there.) This existing behavior could simply be maintained. However, it would be more optimal to pre-zero as shown in Approach B. (For details on why this would be more optimal on all x86 microarchitectures, see here: https://stackoverflow.com/a/33668295).