[Bug target/82267] New: x32: unnecessary address-size prefixes. Why isn't -maddress-mode=long the default?

peter at cordes dot ca Tue, 19 Sep 2017 22:07:17 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267


            Bug ID: 82267
           Summary: x32: unnecessary address-size prefixes.  Why isn't
                    -maddress-mode=long the default?
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: ABI, missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*

x32 defaults to using 32-bit address-size everywhere, it seems.  (Apparently
introduced by rev 185396 for bug 50797, which introduced -maddress-mode=short
and made it the default.)

This takes an extra 1-byte prefix on every instruction with a memory operand. 
It's not just code-size; this is potentially a big throughput problem on Intel
Silvermont where more than 3 prefixes (including mandatory prefixes and 0F
escape bytes for SSE and other instructions) cause a stall.  These are exactly
the systems where a memory-saving ABI might be most useful.  (I'm not building
one, I just think x32 is a good idea if implemented optimally.)

long long doublederef(long long **p){
        return **p;
}
//  https://godbolt.org/g/NHbURq
gcc8 -mx32 -O3
        movl    (%edi), %eax          # 0x67 prefix
        movq    (%eax), %rax          # 0x67 prefix
        ret

The second instruction is 1 byte longer for no reason: it needs a 0x67
address-size prefix to encode.
But we know for certain that the address is already zero-extended into %rax,
because we just put it there.  Also, the ABI requires p to be zero-extended to
64 bits, so it would be safe to use `movl (%rdi), %eax` as the first
instruction.

Even (%rsp) is avoided for some reason, even though -mx32 still uses
push/pop/call/ret which use the full %rsp, so it has to be valid.

int stackuse(void) {
        volatile int foo = 2;
        return foo * 3;
}
        movl    $2, -4(%esp)            # 0x67 prefix
        movl    -4(%esp), %eax          # 0x67 prefix
        leal    (%rax,%rax,2), %eax     # no prefixes
        ret


Compiling with -maddress-mode=long appears to generate optimal code for all the
simple test cases I looked at, e.g.

        movl    $2, -4(%rsp)            # no prefixes
        movl    -4(%rsp), %eax          # no prefixes
        leal    (%rax,%rax,2), %eax     # no prefixes
        ret

-maddress-mode=long still uses an address-size prefix instead of an LEA to make
sure addresses wrap at 4G, and to ignore high garbage in registers:

long long fooi(long long *arr, int offset){
        return arr[offset];
}
        movq    (%edi,%esi,8), %rax    # same for mode=short or long.
        ret

Are there still cases where -maddress-mode=long makes worse code?

----

Is it really necessary for an unsigned offset to be wrap at 4G?  Does ISO C or
GNU C guarantee that large unsigned values work like negative signed integers
when used for pointer arithmetic?

// 64-bit offset so it won't have high garbage
long long fooull(long long *arr, unsigned long long offset){
        return arr[offset];
}

        movq    (%edi,%esi,8), %rax    # but couldn't this be (%rdi,%rsi,8)
        ret

Allowing 64-bit addressing modes with unsigned indexes could potentially save
significant code-size, couldn't it?

address-mode=long already allows constant offsets to go outside 4G, for
example:

foo_constant:         #    return arr[123456];
        movq    987648(%rdi), %rax
        ret

But it does treat the offset as signed, so 0xffffffffULL will  movq -8(%rdi),
%rax.

The ABI doc (https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI) doesn't
specify anything about C pointer-wrapping semantics, and I don't know where
else to look to find out what behaviour is required/guaranteed and what is just
how the current implementation happens to work.

Anyway, this is a side-track from the issue of not using address-size prefixes
in single-pointer cases where it's already zero extended.

---------

SSSE3 and later instructions need 66 0F 3A/38 before the opcode, so an
address-size or REX prefix will cause a decode stall on Silvermont.  With the
default x32 behaviour, even SSE2 instructions (66 0F opcode) will cause decode
stalls with a REX and address-size prefix.  e.g. paddb (%r8d), %xmm8   or even
movdqa (but not movaps or other SSE1 instructions).  Fortunately KNL isn't
really affected: VEX/EVEX is fine unless there's a segment prefix before it,
but Agner Fog seems to be saying that other prefixes are fine.

In integer code, REX + operand-size + address-size + a 0F escape byte would be
a problem for Silvermont/KNL, e.g. imul (%edi), %r10w needs all 4.   movbe %ax,
(%edi) has 4 prefixes, including the 2 mandatory escape bytes: 67 66 0f 38 f1
07.


In-order Atom also has "severe delays" (according to
http://agner.org/optimize/) with more than 3 prefixes, but unlike Silvermont,
that apparently doesn't include mandatory prefixes for SSE instructions. 
Similarly, Bulldozer-family has a 3-prefix limit, but doesn't count escape
bytes, and VEX only counts as 0 or 1 (for 2/3 byte VEX).

[Bug target/82267] New: x32: unnecessary address-size prefixes. Why isn't -maddress-mode=long the default?

Reply via email to