https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846
Bug ID: 96846 Summary: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: andysem at mail dot ru Target Milestone: --- This has been reported already in bug 86352, but that bug also describes a few other issues, so I decoded to create a separate bug focused on this one particular issue. When performing boolean operations, gcc often prefers the following pattern: 1. Test instruction (test/cmp). 2. "setcc %al" (where al can be any 8-bit register). 3. If the bool needs to be further processed, "movzx %al, %eax" (where al and eax are 8 and 32-bit versions of the register picked in #2). The example code can be seen here: https://gcc.godbolt.org/z/z3WGbq For convenience I'm posting the code below: bool narrow_boolean(int a) { return a!=5; } unsigned int countmatch(unsigned int arr[], unsigned int key) { unsigned count = 0; for (int i=0; i<1024 ; i++) { count += ((arr[i] & key) != 5); } return count; } narrow_boolean(int): cmp edi, 5 setne al ret countmatch(unsigned int*, unsigned int): lea rcx, [rdi+4096] xor eax, eax .L4: mov edx, DWORD PTR [rdi] and edx, esi cmp edx, 5 setne dl movzx edx, dl add rdi, 4 add eax, edx cmp rcx, rdi jne .L4 ret Command line parameters: -Wall -O3 -mtune=skylake -fno-tree-vectorize The problem with the generated code is as follows: - The setcc instruction only modifies the lower 8 bits of the full architectural register. The upper bits remain unmodified and potentially "dirty" meaning that the following instructions taking this register as input may require merging the full register value, with a performance penalty. - Since setcc preserves upper bits, the following instructions consuming the output register become dependent on the previous instructions that write that register. This results in an unnecessary dependency. This is especially a problem in the case of narrow_boolean above. - On Haswell and later, "movzx %al, %eax" is not eliminated at register rename and consumes ALU and has non-zero latency. "movzx %al, %ebx" (i.e. when the output register is different from input) is eliminated at rename stage, but requires an additional register. But gcc seems to prefer the former form, unless -frename-registers is specified. See this excellent StackOverflow question and answers for details: https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to (BTW, the code in this bug originated from that question.) A better instruction pattern would be this: 1. Zero the target flag register beforehand with "xor %eax, %eax". 2. Perform the test. 3. "setcc %al" to set the flag. In case if this pattern is in a loop body, the initial xor can be hoisted out of the loop. The important part is that the xor eliminates the dependency and doesn't cost ALU uop. Arguably, it is also smaller code since xor is only 2 bytes vs. 3 bytes for movxz. The initial zeroing requires an additional register, meaning that it cannot reuse the register involved in the test in #2. However, that is often not a problem, like in the examples above. I guess, the compiler could estimate if there are spare registers and use xor/test/setcc sequence if there are and test/setcc/movzx if not.