https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267
Bug ID: 82267 Summary: x32: unnecessary address-size prefixes. Why isn't -maddress-mode=long the default? Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: ABI, missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-* x32 defaults to using 32-bit address-size everywhere, it seems. (Apparently introduced by rev 185396 for bug 50797, which introduced -maddress-mode=short and made it the default.) This takes an extra 1-byte prefix on every instruction with a memory operand. It's not just code-size; this is potentially a big throughput problem on Intel Silvermont where more than 3 prefixes (including mandatory prefixes and 0F escape bytes for SSE and other instructions) cause a stall. These are exactly the systems where a memory-saving ABI might be most useful. (I'm not building one, I just think x32 is a good idea if implemented optimally.) long long doublederef(long long **p){ return **p; } // https://godbolt.org/g/NHbURq gcc8 -mx32 -O3 movl (%edi), %eax # 0x67 prefix movq (%eax), %rax # 0x67 prefix ret The second instruction is 1 byte longer for no reason: it needs a 0x67 address-size prefix to encode. But we know for certain that the address is already zero-extended into %rax, because we just put it there. Also, the ABI requires p to be zero-extended to 64 bits, so it would be safe to use `movl (%rdi), %eax` as the first instruction. Even (%rsp) is avoided for some reason, even though -mx32 still uses push/pop/call/ret which use the full %rsp, so it has to be valid. int stackuse(void) { volatile int foo = 2; return foo * 3; } movl $2, -4(%esp) # 0x67 prefix movl -4(%esp), %eax # 0x67 prefix leal (%rax,%rax,2), %eax # no prefixes ret Compiling with -maddress-mode=long appears to generate optimal code for all the simple test cases I looked at, e.g. movl $2, -4(%rsp) # no prefixes movl -4(%rsp), %eax # no prefixes leal (%rax,%rax,2), %eax # no prefixes ret -maddress-mode=long still uses an address-size prefix instead of an LEA to make sure addresses wrap at 4G, and to ignore high garbage in registers: long long fooi(long long *arr, int offset){ return arr[offset]; } movq (%edi,%esi,8), %rax # same for mode=short or long. ret Are there still cases where -maddress-mode=long makes worse code? ---- Is it really necessary for an unsigned offset to be wrap at 4G? Does ISO C or GNU C guarantee that large unsigned values work like negative signed integers when used for pointer arithmetic? // 64-bit offset so it won't have high garbage long long fooull(long long *arr, unsigned long long offset){ return arr[offset]; } movq (%edi,%esi,8), %rax # but couldn't this be (%rdi,%rsi,8) ret Allowing 64-bit addressing modes with unsigned indexes could potentially save significant code-size, couldn't it? address-mode=long already allows constant offsets to go outside 4G, for example: foo_constant: # return arr[123456]; movq 987648(%rdi), %rax ret But it does treat the offset as signed, so 0xffffffffULL will movq -8(%rdi), %rax. The ABI doc (https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI) doesn't specify anything about C pointer-wrapping semantics, and I don't know where else to look to find out what behaviour is required/guaranteed and what is just how the current implementation happens to work. Anyway, this is a side-track from the issue of not using address-size prefixes in single-pointer cases where it's already zero extended. --------- SSSE3 and later instructions need 66 0F 3A/38 before the opcode, so an address-size or REX prefix will cause a decode stall on Silvermont. With the default x32 behaviour, even SSE2 instructions (66 0F opcode) will cause decode stalls with a REX and address-size prefix. e.g. paddb (%r8d), %xmm8 or even movdqa (but not movaps or other SSE1 instructions). Fortunately KNL isn't really affected: VEX/EVEX is fine unless there's a segment prefix before it, but Agner Fog seems to be saying that other prefixes are fine. In integer code, REX + operand-size + address-size + a 0F escape byte would be a problem for Silvermont/KNL, e.g. imul (%edi), %r10w needs all 4. movbe %ax, (%edi) has 4 prefixes, including the 2 mandatory escape bytes: 67 66 0f 38 f1 07. In-order Atom also has "severe delays" (according to http://agner.org/optimize/) with more than 3 prefixes, but unlike Silvermont, that apparently doesn't include mandatory prefixes for SSE instructions. Similarly, Bulldozer-family has a 3-prefix limit, but doesn't count escape bytes, and VEX only counts as 0 or 1 (for 2/3 byte VEX).