https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267
H.J. Lu <hjl.tools at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC|hjl at gcc dot gnu.org |hjl.tools at gmail dot com --- Comment #2 from H.J. Lu <hjl.tools at gmail dot com> --- (In reply to Peter Cordes from comment #0) > x32 defaults to using 32-bit address-size everywhere, it seems. (Apparently > introduced by rev 185396 for bug 50797, which introduced > -maddress-mode=short and made it the default.) > > This takes an extra 1-byte prefix on every instruction with a memory > operand. It's not just code-size; this is potentially a big throughput > problem on Intel Silvermont where more than 3 prefixes (including mandatory > prefixes and 0F escape bytes for SSE and other instructions) cause a stall. > These are exactly the systems where a memory-saving ABI might be most > useful. (I'm not building one, I just think x32 is a good idea if > implemented optimally.) > > long long doublederef(long long **p){ > return **p; > } > // https://godbolt.org/g/NHbURq > gcc8 -mx32 -O3 > movl (%edi), %eax # 0x67 prefix > movq (%eax), %rax # 0x67 prefix > ret > > The second instruction is 1 byte longer for no reason: it needs a 0x67 > address-size prefix to encode. > But we know for certain that the address is already zero-extended into %rax, > because we just put it there. Also, the ABI requires p to be zero-extended > to 64 bits, so it would be safe to use `movl (%rdi), %eax` as the first > instruction. > > Even (%rsp) is avoided for some reason, even though -mx32 still uses > push/pop/call/ret which use the full %rsp, so it has to be valid. > > int stackuse(void) { > volatile int foo = 2; > return foo * 3; > } > movl $2, -4(%esp) # 0x67 prefix > movl -4(%esp), %eax # 0x67 prefix We can encode (%esp) as (%rsp) since the upper bits of RSP are zero. > leal (%rax,%rax,2), %eax # no prefixes > ret > > > Compiling with -maddress-mode=long appears to generate optimal code for all > the simple test cases I looked at, e.g. > > movl $2, -4(%rsp) # no prefixes > movl -4(%rsp), %eax # no prefixes > leal (%rax,%rax,2), %eax # no prefixes > ret > > -maddress-mode=long still uses an address-size prefix instead of an LEA to > make sure addresses wrap at 4G, and to ignore high garbage in registers: > > long long fooi(long long *arr, int offset){ > return arr[offset]; > } > movq (%edi,%esi,8), %rax # same for mode=short or long. > ret > > Are there still cases where -maddress-mode=long makes worse code? Yes, there are more places where -maddress-mode=long needs to zero-extend address to 64 bits where 0x67 prefix does for you. > ---- > > Is it really necessary for an unsigned offset to be wrap at 4G? Does ISO C > or GNU C guarantee that large unsigned values work like negative signed > integers when used for pointer arithmetic? > > // 64-bit offset so it won't have high garbage > long long fooull(long long *arr, unsigned long long offset){ > return arr[offset]; > } > > movq (%edi,%esi,8), %rax # but couldn't this be (%rdi,%rsi,8) > ret > > Allowing 64-bit addressing modes with unsigned indexes could potentially > save significant code-size, couldn't it? > > address-mode=long already allows constant offsets to go outside 4G, for > example: > > foo_constant: # return arr[123456]; > movq 987648(%rdi), %rax > ret > > But it does treat the offset as signed, so 0xffffffffULL will movq > -8(%rdi), %rax. > > The ABI doc (https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI) doesn't > specify anything about C pointer-wrapping semantics, and I don't know where > else to look to find out what behaviour is required/guaranteed and what is > just how the current implementation happens to work. > > Anyway, this is a side-track from the issue of not using address-size > prefixes in single-pointer cases where it's already zero extended. > > --------- > > SSSE3 and later instructions need 66 0F 3A/38 before the opcode, so an > address-size or REX prefix will cause a decode stall on Silvermont. With That is true. > the default x32 behaviour, even SSE2 instructions (66 0F opcode) will cause > decode stalls with a REX and address-size prefix. e.g. paddb (%r8d), %xmm8 > or even movdqa (but not movaps or other SSE1 instructions). Fortunately KNL > isn't really affected: VEX/EVEX is fine unless there's a segment prefix > before it, but Agner Fog seems to be saying that other prefixes are fine. > > In integer code, REX + operand-size + address-size + a 0F escape byte would > be a problem for Silvermont/KNL, e.g. imul (%edi), %r10w needs all 4. > movbe %ax, (%edi) has 4 prefixes, including the 2 mandatory escape bytes: 67 > 66 0f 38 f1 07. > > > In-order Atom also has "severe delays" (according to > http://agner.org/optimize/) with more than 3 prefixes, but unlike > Silvermont, that apparently doesn't include mandatory prefixes for SSE > instructions. Similarly, Bulldozer-family has a 3-prefix limit, but doesn't > count escape bytes, and VEX only counts as 0 or 1 (for 2/3 byte VEX). But 0x67 prefix is still better.