movzx sequence

andysem at mail dot ru Sat, 29 Aug 2020 10:48:38 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846

Bug ID: 96846
Summary: [x86] Prefer xor/test/setcc over test/setcc/movzx
sequence
Product: gcc
Version: 10.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: andysem at mail dot ru
Target Milestone: ---

This has been reported already in bug 86352, but that bug also describes a few
other issues, so I decoded to create a separate bug focused on this one
particular issue.

When performing boolean operations, gcc often prefers the following pattern:

1. Test instruction (test/cmp).
2. "setcc %al" (where al can be any 8-bit register).
3. If the bool needs to be further processed, "movzx %al, %eax" (where al and
eax are 8 and 32-bit versions of the register picked in #2).

The example code can be seen here:

https://gcc.godbolt.org/z/z3WGbq

For convenience I'm posting the code below:

bool narrow_boolean(int a) { return a!=5; }

unsigned int countmatch(unsigned int arr[], unsigned int key)
{
unsigned count = 0;
for (int i=0; i<1024 ; i++) {
count += ((arr[i] & key) != 5);
}
return count;
}

narrow_boolean(int):
cmp edi, 5
setne al
ret

countmatch(unsigned int*, unsigned int):
lea rcx, [rdi+4096]
xor eax, eax
.L4:
mov edx, DWORD PTR [rdi]
and edx, esi
cmp edx, 5
setne dl
movzx edx, dl
add rdi, 4
add eax, edx
cmp rcx, rdi
jne .L4
ret

Command line parameters: -Wall -O3 -mtune=skylake -fno-tree-vectorize

The problem with the generated code is as follows:

- The setcc instruction only modifies the lower 8 bits of the full
architectural register. The upper bits remain unmodified and potentially
"dirty" meaning that the following instructions taking this register as input
may require merging the full register value, with a performance penalty.
- Since setcc preserves upper bits, the following instructions consuming the
output register become dependent on the previous instructions that write that
register. This results in an unnecessary dependency. This is especially a
problem in the case of narrow_boolean above.
- On Haswell and later, "movzx %al, %eax" is not eliminated at register rename
and consumes ALU and has non-zero latency. "movzx %al, %ebx" (i.e. when the
output register is different from input) is eliminated at rename stage, but
requires an additional register. But gcc seems to prefer the former form,
unless -frename-registers is specified.

See this excellent StackOverflow question and answers for details:

https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to

(BTW, the code in this bug originated from that question.)

A better instruction pattern would be this:

1. Zero the target flag register beforehand with "xor %eax, %eax".
2. Perform the test.
3. "setcc %al" to set the flag.

In case if this pattern is in a loop body, the initial xor can be hoisted out
of the loop. The important part is that the xor eliminates the dependency and
doesn't cost ALU uop. Arguably, it is also smaller code since xor is only 2
bytes vs. 3 bytes for movxz.

The initial zeroing requires an additional register, meaning that it cannot
reuse the register involved in the test in #2. However, that is often not a
problem, like in the examples above. I guess, the compiler could estimate if
there are spare registers and use xor/test/setcc sequence if there are and
test/setcc/movzx if not.

[Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence

Reply via email to