https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67780
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Excess instructions for |Excess instructions for |when returning an int from |returning an int from a |testing a bit in a uint16_t |bit-test in a uint16_t |table, including |table, including |slow-decode-on-Intel |slow-decode-on-Intel |length-changing prefix. |length-changing prefix. |Seen with ctype (isalnum) |Seen with ctype (isalnum) --- Comment #2 from Peter Cordes <peter at cordes dot ca> --- title cleanup side note: I searched glibc for other cases of length-changing prefixes, with a search pattern of \$.*,%(.[xip]|r[0-9]+w)$. That regex looks for a $ sign for immediate data with a 16bit register destination. (two character reg name ending in [xip], or numbered ending in w). 16bit immediate constants in ALU instructions with a memory operand are LCP stalls, too. SnB and newer don't stall on mov r16, imm16, just on ALU. Agner Fog's objconv disassembler / converter (http://agner.org/optimize/) by default inserts comments about length-changing prefix instructions, so that's a more reliable way to look for LCPs. There are quite a few cases of gcc saving one instruction byte at the cost of a length-changing prefix. (imm32 vs. operand-size-prefix + imm16). This is a terrible choice for code that should run well on Intel, esp. pre-Sandybridge (no uop cache). e.g. (IDK what source lines generated this, but it's from Ubuntu 15.04's glibc): 291dc: 0f b7 41 fe movzwl -0x2(%rcx),%eax 291e0: 66 c1 c8 08 ror $0x8,%ax 291e4: 8d b8 00 28 00 00 lea 0x2800(%rax),%edi 291ea: 66 81 ff ff 07 cmp $0x7ff,%di 291ef: 77 2f ja 29220 <__gconv_get_alias_db+0x7020> ... 29220: 0f b7 c0 movzwl %ax,%eax # control can only reach here from the preceding ja 29223: 89 45 00 mov %eax,0x0(%rbp) I think this is another example of redundant movzwl: the lea that computes edi=0x2800+rax already used the full width of rax (leading to a partial-register extra uop on Intel pre-Haswell). The cmp only tested the low 16 of the result, so it might be useful even if the movzwl load hadn't zeroed the upper bits. I'm not sure if edi is just a scratch reg at this point, or if it's useful to have edi=rax+0x2800. If not, then unless I'm making a logic error, it would be a *lot* better to do: ror $0x8,%ax cmp $0x7ff-0x2800, %eax jg (jg instead of ja, because we're now comparing against a negative constant. I think we can get away with this because we're using a wider type than the original data. I'm not sure if this is always equivalent when adding 0x2800 would overflow the low 16 that the original code checks.) Even if the lea is needed, it's probably best to just use cmp imm32 regardless, unless tuning for CPUs that don't have LCP decoder stalls, or unless this code is very cold. (Pre-Sandybridge, a 16B fetch block containing any LCPs stalls for ~6 cycles. SnB and later stall for 2-3 cycles per LCP, so multiple LCP instructions are worse than one, even in the same 16B block.) ---- I don't see a good alternative to the ror by 8, since bswap is only available for r32 and r64 operands, not r16. movbe is only available on Atom / Haswell, and doesn't zero-extend. Still, the best sequence for CPUs that do have movbe might be: movbe -2(%rcx), %ax movzwl %ax, %eax cmp $0x7ff-0x2800, %eax jg That also avoids reading a 32bit reg after writing a 16bit part of it, which makes pre-Haswell CPUs insert an extra uop to merge the results, possibly with a penalty larger than you'd expect for just one uop according to Agner Fog. (I haven't tested this.)