https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87812

            Bug ID: 87812
           Summary: X86-64 Vector __m256 return ABI needs clarification
                    (discrepancy between implementations).
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: iains at gcc dot gnu.org
  Target Milestone: ---

For a processor that supports SSE, but not AVX.

the following code:

typedef int __attribute__((mode(QI))) qi;
typedef qi __attribute__((vector_size (32))) v32qi;

v32qi foo (int x)
{
 v32qi y = {'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f',
         '0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'};
 return y;
}

produces a warning " warning: AVX vector return without AVX enabled changes the
ABI [-Wpsabi]”.

so - the question is what is the resultant ABI in the changed case (since _m256
is supported for such processors)

====

Looking at the psABI v1.0 

* pp24 Returning of Values

The returning of values is done according to the following algorithm:

        • Classify the return type with the classification algorithm.

…
        • If the class is SSE, the next available vector register of the
sequence %xmm0, %xmm1 is used.

        • If the class is SSEUP, the eight byte is returned in the next
available eightbyte chunk of the last used vector register.

...

* classification algorithm : pp20

        • Arguments of type __m256 are split into four eightbyte chunks. The
least significant one belongs to class SSE and all the others to class SSEUP.

        • Arguments of type __m512 are split into eight eightbyte chunks. The
least significant one belongs to class SSE and all the others to class SSEUP.

*  footnote on pp21

12 The post merger clean up described later ensures that, for the processors
that do not support the __m256 type, if the size of an object is larger than
two eightbytes and the first eightbyte is not SSE or any other eightbyte is not
SSEUP, it still has class MEMORY. 

This in turn ensures that for processors that do support the __m256 type, if
the size of an object is four eightbytes and the first eightbyte is SSE and all
other eightbytes are SSEUP, it can be passed in a register. This also applies
to the __m512 type. That is for processors that support the __m512 type, if the
size of an object is eight eightbytes and the first eightbyte is SSE and all
other eightbytes are SSEUP, it can be passed in a register, otherwise, it will
be passed in memory.

---

However : the case where the processor does *not* support __m256 but the first
eightbyte *is* SSE and the following eighbytes *are* SSEUP is not clarified.

The intent for SSE seems clear - use a reg
The intent for following SSEUP is less clear.

Nevertheless, it seems to imply that the intent for processors with SSE that
the __m256 (and __m512) returns should be passed in xmm0:1(:3, maybe).

figure 3.4 pp23 does not clarify xmm* use for vector return at all - only
mentioning floating point.

===== status

In any event, GCC passes the vec32 return in memory,
LLVM conversely passes it in xmm0:1 (at least for the versions I’ve tried).

which leads to an ABI discrepancy when GCC is used to build code on systems
based on LLVM.

Please could the X86 maintainers clarify the intent (and maybe consider
enhancing the footnote classification notes to make things clearer)?

- and then we can figure out how to deal with the systems that are already
implemented - and how to move forward.

(as an aside, in any event, it seems inefficient to pass through memory when at
least xmm0:1 are already set aside for return value use).

Reply via email to