[Bug target/67780] New: Excess instructions for when returning an int from testing a bit in a uint16_t table, including slow-decode-on-Intel length-changing prefix. Seen with ctype (isalnum)

peter at cordes dot ca Wed, 30 Sep 2015 04:18:27 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67780


            Bug ID: 67780
           Summary: Excess instructions for when returning an int from
                    testing a bit in a uint16_t table, including
                    slow-decode-on-Intel length-changing prefix.  Seen
                    with ctype (isalnum)
           Product: gcc
           Version: 5.2.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86, x86-64

tags and stuff may be incomplete, I'm not sure if x86, x86-64 is correct for
target.

Many of the ctype glibc functions (isalnum, for example) compile to a lookup of
a 16bit type mask, and then returning one bit of that.

All the glibc macros boil down to something essentially like this:

unsigned short* get_table(void);  // one less layer of indirection than
__ctyle_b_loc
int foo(int c) {
  return get_table()[c] & 1U<<10;
}
int bar(int c) {
  return get_table()[c] & 1U<<3;
}

Testing on godbolt (https://goo.gl/SWJEyP) shows:

g++ from 4.4.7 to 4.6.4 make sub-optimal x86-64 code that uses mov+movsx
separately.

g++ 4.7.3 to 5.2.0 make terrible x86-64 and x86 code that uses AND $1024, %ax. 
This is a case where the operand-size prefix changes the length of the *rest*
of the instruction, stalling the decoders in Intel CPUs for multiple cycles.


 gcc avoids the LCP with -m32 -mtune=pentium3, but not with -mtune=core2. 
Core2 may avoid the LCP stall if the instruction is part of a very small
(64byte) loop, since core2 can use its instruction fetch queue as a loop buffer
(before the decoders, but maybe with insn lengths already marked).  Intel CPUs
with a uop-cache (SnB and later) are hurt less often by LCP stalls, but in this
case writing the partial register is a bad idea anyway.


Clang produces an optimal sequence:
foo(int):                                # @foo(int)
        pushq   %rbx
        movslq  %edi, %rbx
        callq   get_table()
        movzwl  (%rax,%rbx,2), %eax
        andl    $1024, %eax             # imm = 0x400
        popq    %rbx
        retq


5.2.0 g++ -O3 -march=native -mtune=native on godbolt (where native=haswell)
does this:

foo(int):
        pushq   %rbx    #
        movslq  %edi, %rbx      # c,
        call    get_table()     #
        movzwl  (%rax,%rbx,2), %eax     # *_7, *_7
        popq    %rbx    #
        andw    $1024, %ax      #, D.2584
        movzwl  %ax, %eax       # D.2584, D.2585
        ret

The last movzwl is there I guess to avoid the partial-register penalty from
writing %ax instead of %eax.  The 3 instruction bytes for movzwl  take more
space than was saved by using andw with an imm16 (4 bytes including the
operand-size prefix) instead of andl with an imm32 (5 bytes).  Since we're
tuning for Haswell, we should omit the movzwl.  Agner Fog says that Haswell has
no visible perf penalties for writing partial registers.  SnB/IvB still need an
extra uop, but maybe have a smaller penalty than earlier CPUs.  It sounds like
core2 takes 2-3 extra cycles to insert the extra uop.

Maybe that's not why gcc was using movzwl?  bar compiles to
...
        movzwl  (%rax,%rbx,2), %eax     # *_7, *_7
        popq    %rbx    #
        andl    $8, %eax        #, D.2604
        movzwl  %ax, %eax       # D.2604, D.2605
        ret

That's  AND r32, imm8   so gcc doesn't try to save space by using the
LCP-penalty-inducing imm16 version this time.  However, it zero-extends twice:
once as part of the load, and again after the and.

Since the high bit of the 16bit word can't be set after the AND, the 3-byte
movzwl could be replaced with the one-byte CWDE to sign-extend ax to eax, but
either way it just shouldn't be there wasting space in the uop cache.  This is
probably still a target bug, since ARM and ARM64 don't have an extra
instruction after the AND.



This is a regression from older g++, like 4.4.7: which produced

foo(int):
        pushq   %rbx    #
        movl    %edi, %ebx      # c, c
        call    get_table()     #
        movslq  %ebx,%rbx       # c, c
        movzwl  (%rax,%rbx,2), %eax     #* D.2216, tmp61
        popq    %rbx    #
        andl    $1024, %eax     #, tmp61
        ret

Note the 32bit mov before the call, then sign-extending after.  Other than
that, it's optimal.


This bug affects the C frontend, not just C++ (which is all godbolt gives
access to for easy testing):

objdump -d /lib/x86_64-linux-gnu/libc-2.21.so  # on x86-64 Ubuntu 15.04:

000000000002e190 <isalpha>:
   2e190:       48 8b 05 69 5c 39 00    mov    0x395c69(%rip),%rax        #
3c3e00 <_IO_file_jumps+0x5c0>
   2e197:       48 63 ff                movslq %edi,%rdi
   2e19a:       64 48 8b 00             mov    %fs:(%rax),%rax
   2e19e:       0f b7 04 78             movzwl (%rax,%rdi,2),%eax
   2e1a2:       66 25 00 04             and    $0x400,%ax
   2e1a6:       0f b7 c0                movzwl %ax,%eax
   2e1a9:       c3                      retq   

Note that the call to `__ctype_b_loc` is inlined into the libc function, but we
still see the LCP-penalty-inducing  AND imm16  and a redundant movzwl %ax,
%eax.

[Bug target/67780] New: Excess instructions for when returning an int from testing a bit in a uint16_t table, including slow-decode-on-Intel length-changing prefix. Seen with ctype (isalnum)

Reply via email to