[Bug target/67780] Excess instructions for returning an int from a bit-test in a uint16_t table, including slow-decode-on-Intel length-changing prefix. Seen with ctype (isalnum)

peter at cordes dot ca Fri, 02 Oct 2015 00:54:13 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67780


Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Excess instructions for     |Excess instructions for
                   |when returning an int from  |returning an int from a
                   |testing a bit in a uint16_t |bit-test in a uint16_t
                   |table, including            |table, including
                   |slow-decode-on-Intel        |slow-decode-on-Intel
                   |length-changing prefix.     |length-changing prefix.
                   |Seen with ctype (isalnum)   |Seen with ctype (isalnum)

--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
title cleanup

side note:

I searched glibc for other cases of length-changing prefixes, with a search
pattern of  \$.*,%(.[xip]|r[0-9]+w)$.

That regex looks for a $ sign for immediate data with a 16bit register
destination.  (two character reg name ending in [xip], or numbered ending in
w).  16bit immediate constants in ALU instructions with a memory operand are
LCP stalls, too.  SnB and newer don't stall on mov r16, imm16, just on ALU. 
Agner Fog's objconv disassembler / converter (http://agner.org/optimize/) by
default inserts comments about length-changing prefix instructions, so that's a
more reliable way to look for LCPs.


There are quite a few cases of gcc saving one instruction byte at the cost of a
length-changing prefix.  (imm32 vs. operand-size-prefix + imm16).  This is a
terrible choice for code that should run well on Intel, esp. pre-Sandybridge
(no uop cache).

e.g. (IDK what source lines generated this, but it's from Ubuntu 15.04's
glibc):

   291dc:       0f b7 41 fe             movzwl -0x2(%rcx),%eax
   291e0:       66 c1 c8 08             ror    $0x8,%ax
   291e4:       8d b8 00 28 00 00       lea    0x2800(%rax),%edi
   291ea:       66 81 ff ff 07          cmp    $0x7ff,%di
   291ef:       77 2f                   ja     29220
<__gconv_get_alias_db+0x7020>
...
   29220:       0f b7 c0                movzwl %ax,%eax   # control can only
reach here from the preceding ja
   29223:       89 45 00                mov    %eax,0x0(%rbp)


I think this is another example of redundant movzwl: the lea that computes
edi=0x2800+rax already used the full width of rax (leading to a
partial-register extra uop on Intel pre-Haswell).  The cmp only tested the low
16 of the result, so it might be useful even if the movzwl load hadn't zeroed
the upper bits.

I'm not sure if edi is just a scratch reg at this point, or if it's useful to
have edi=rax+0x2800.  If not, then unless I'm making a logic error, it would be
a *lot* better to do:

   ror   $0x8,%ax
   cmp   $0x7ff-0x2800, %eax
   jg

(jg instead of ja, because we're now comparing against a negative constant.  I
think we can get away with this because we're using a wider type than the
original data.  I'm not sure if this is always equivalent when adding 0x2800
would overflow the low 16 that the original code checks.)


Even if the lea is needed, it's probably best to just use cmp imm32 regardless,
unless tuning for CPUs that don't have LCP decoder stalls, or unless this code
is very cold.  (Pre-Sandybridge, a 16B fetch block containing any LCPs stalls
for ~6 cycles.  SnB and later stall for 2-3 cycles per LCP, so multiple LCP
instructions are worse than one, even in the same 16B block.)

----

I don't see a good alternative to the ror by 8, since bswap is only available
for r32 and r64 operands, not r16.

movbe is only available on Atom / Haswell, and doesn't zero-extend.  Still, the
best sequence for CPUs that do have movbe might be:

 movbe -2(%rcx), %ax
 movzwl %ax, %eax
 cmp   $0x7ff-0x2800, %eax
 jg

That also avoids reading a 32bit reg after writing a 16bit part of it, which
makes pre-Haswell CPUs insert an extra uop to merge the results, possibly with
a penalty larger than you'd expect for just one uop according to Agner Fog.  (I
haven't tested this.)

[Bug target/67780] Excess instructions for returning an int from a bit-test in a uint16_t table, including slow-decode-on-Intel length-changing prefix. Seen with ctype (isalnum)

Reply via email to