On 14/10/11 11:42, Julian Brown wrote:
> On Thu, 13 Oct 2011 16:12:17 +0100
> Richard Earnshaw <rearn...@arm.com> wrote:
> 
>> On 13/10/11 15:56, Joseph S. Myers wrote:
>>> Indeed, vector initializers are part of the target-independent GNU
>>> C language and have target-independent semantics that the elements
>>> go in memory order, corresponding to the target-independent
>>> semantics of lane numbers where they appear in GENERIC, GIMPLE and
>>> (non-UNSPEC) RTL and any target-independent built-in functions that
>>> use such numbers.  (The issue here being, as you saw, that the lane
>>> numbers used in ARM-specific NEON intrinsics are for big-endian not
>>> the same as those used in target-independent features of GNU C and
>>> target-independent internal representations in GCC - hence various
>>> code to translate them between the two conventions when processing
>>> intrinsics into non-UNSPEC RTL, and to translate back when
>>> generating assembly instructions that encode lane numbers with the
>>> ARM conventions, as expounded at greater length at
>>> <http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html>.)
>>>
>>
>> This is all rather horrible, and leads to THREE different layouts for
>> a 128-bit vector for big-endian Neon.
>>
>> GCC format
>> 'VLD1.n' format
>> 'ABI' format
>>
>> GCC format and 'ABI' format differ in that the 64-bit words of the
>> 128-bit vector are swapped.
>>
>> All this and they are all expected to share a single machine mode.
>>
>> Furthermore, the definitions in GCC are broken, in that the types
>> defined in arm_neon.h (eg int8x16_t) are supposed to be ABI format,
>> not GCC format.
>>
>> Eukkkkkk! :-(
> 
> FWIW, I thought long and hard about this problem, and eventually gave
> up trying to solve it. Note that many operations which depend on the
> ordering of vectors are now disabled entirely (at least for Q regs) in
> neon.md in big-endian mode to try and limit the damage. NEON is
> basically only supported properly in little-endian mode, IMO.
> 
> I'd love to see this resolved properly. Some random observations:
> 
>  * The vectorizer can use whatever layout it wants for vectors in
>    either endianness. Vectorizer vectors never interact with either
>    GCC generic (source-level) vectors, nor the NEON intrinsics. Also
>    they never cross ABI boundaries.
> 
>  * GCC generic vectors aren't specified very formally, particularly wrt.
>    their interaction with NEON intrinsics. If you stick *entirely* to
>    accessing vectors via NEON intrinsics, the problems in big-endian
>    mode (I think) don't ever materialise. This includes not using
>    indirection to load/store vectors, and (of course) not constructing
>    vectors using { x, y, z... } syntax. One possibility might be to
>    detect and *disallow* code which attempts to mix vector operations
>    like that.
> 
> I don't quite understand your comment about the GCC definitions of
> int8x16_t etc. being broken, tbh...
> 

the 128-bit vectors are loaded as a pair of D regs, with D<n> holding
the lower addressed D-word and D<n+1> holding the higher addressed
D-word; but these are treated in a Q reg as {D<n+1>:D<n>}. On a
big-endian machine that means D<n> contains the most significant lanes
of the vector and D<n+1> the least significant lanes.  For a big-endian
view we really need to see these as {D<n>:D<n+1>} (read {a:b} as
bit-wise concatenation of a and b).

One way we might address this is to redefine our 128-bit vector types as
structs of low/high Dwords.  Each Dword remains a vector (apart from
64-bit lane types), but the Dword order then matches the ABI
specification correctly.  For example, the definition of uint8x16_t becomes

        typedef struct { uint8x8_t _val[2]; } uint8x16_t;

that is we consider this to be a pair of 64-bit vectors.  Obviously
there would be similar definitions for the other vector types.  This
then gives the correct view on the world because D<n> is always _val[0]
and D<n+1> is always _val[1].

Secondly, all vector loads/stores should really be changed to use
vld1.64 (with {d<n>, d<n+1>} as the register list for 128-bit accesses)
rather than vldm; this then sorts out any issues with unaligned accesses
without changing the memory format.

> Cheers,
> 
> Julian
> 


Reply via email to