https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67780
Bug ID: 67780 Summary: Excess instructions for when returning an int from testing a bit in a uint16_t table, including slow-decode-on-Intel length-changing prefix. Seen with ctype (isalnum) Product: gcc Version: 5.2.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86, x86-64 tags and stuff may be incomplete, I'm not sure if x86, x86-64 is correct for target. Many of the ctype glibc functions (isalnum, for example) compile to a lookup of a 16bit type mask, and then returning one bit of that. All the glibc macros boil down to something essentially like this: unsigned short* get_table(void); // one less layer of indirection than __ctyle_b_loc int foo(int c) { return get_table()[c] & 1U<<10; } int bar(int c) { return get_table()[c] & 1U<<3; } Testing on godbolt (https://goo.gl/SWJEyP) shows: g++ from 4.4.7 to 4.6.4 make sub-optimal x86-64 code that uses mov+movsx separately. g++ 4.7.3 to 5.2.0 make terrible x86-64 and x86 code that uses AND $1024, %ax. This is a case where the operand-size prefix changes the length of the *rest* of the instruction, stalling the decoders in Intel CPUs for multiple cycles. gcc avoids the LCP with -m32 -mtune=pentium3, but not with -mtune=core2. Core2 may avoid the LCP stall if the instruction is part of a very small (64byte) loop, since core2 can use its instruction fetch queue as a loop buffer (before the decoders, but maybe with insn lengths already marked). Intel CPUs with a uop-cache (SnB and later) are hurt less often by LCP stalls, but in this case writing the partial register is a bad idea anyway. Clang produces an optimal sequence: foo(int): # @foo(int) pushq %rbx movslq %edi, %rbx callq get_table() movzwl (%rax,%rbx,2), %eax andl $1024, %eax # imm = 0x400 popq %rbx retq 5.2.0 g++ -O3 -march=native -mtune=native on godbolt (where native=haswell) does this: foo(int): pushq %rbx # movslq %edi, %rbx # c, call get_table() # movzwl (%rax,%rbx,2), %eax # *_7, *_7 popq %rbx # andw $1024, %ax #, D.2584 movzwl %ax, %eax # D.2584, D.2585 ret The last movzwl is there I guess to avoid the partial-register penalty from writing %ax instead of %eax. The 3 instruction bytes for movzwl take more space than was saved by using andw with an imm16 (4 bytes including the operand-size prefix) instead of andl with an imm32 (5 bytes). Since we're tuning for Haswell, we should omit the movzwl. Agner Fog says that Haswell has no visible perf penalties for writing partial registers. SnB/IvB still need an extra uop, but maybe have a smaller penalty than earlier CPUs. It sounds like core2 takes 2-3 extra cycles to insert the extra uop. Maybe that's not why gcc was using movzwl? bar compiles to ... movzwl (%rax,%rbx,2), %eax # *_7, *_7 popq %rbx # andl $8, %eax #, D.2604 movzwl %ax, %eax # D.2604, D.2605 ret That's AND r32, imm8 so gcc doesn't try to save space by using the LCP-penalty-inducing imm16 version this time. However, it zero-extends twice: once as part of the load, and again after the and. Since the high bit of the 16bit word can't be set after the AND, the 3-byte movzwl could be replaced with the one-byte CWDE to sign-extend ax to eax, but either way it just shouldn't be there wasting space in the uop cache. This is probably still a target bug, since ARM and ARM64 don't have an extra instruction after the AND. This is a regression from older g++, like 4.4.7: which produced foo(int): pushq %rbx # movl %edi, %ebx # c, c call get_table() # movslq %ebx,%rbx # c, c movzwl (%rax,%rbx,2), %eax #* D.2216, tmp61 popq %rbx # andl $1024, %eax #, tmp61 ret Note the 32bit mov before the call, then sign-extending after. Other than that, it's optimal. This bug affects the C frontend, not just C++ (which is all godbolt gives access to for easy testing): objdump -d /lib/x86_64-linux-gnu/libc-2.21.so # on x86-64 Ubuntu 15.04: 000000000002e190 <isalpha>: 2e190: 48 8b 05 69 5c 39 00 mov 0x395c69(%rip),%rax # 3c3e00 <_IO_file_jumps+0x5c0> 2e197: 48 63 ff movslq %edi,%rdi 2e19a: 64 48 8b 00 mov %fs:(%rax),%rax 2e19e: 0f b7 04 78 movzwl (%rax,%rdi,2),%eax 2e1a2: 66 25 00 04 and $0x400,%ax 2e1a6: 0f b7 c0 movzwl %ax,%eax 2e1a9: c3 retq Note that the call to `__ctype_b_loc` is inlined into the libc function, but we still see the LCP-penalty-inducing AND imm16 and a redundant movzwl %ax, %eax.