On 14/10/11 11:42, Julian Brown wrote: > On Thu, 13 Oct 2011 16:12:17 +0100 > Richard Earnshaw <rearn...@arm.com> wrote: > >> On 13/10/11 15:56, Joseph S. Myers wrote: >>> Indeed, vector initializers are part of the target-independent GNU >>> C language and have target-independent semantics that the elements >>> go in memory order, corresponding to the target-independent >>> semantics of lane numbers where they appear in GENERIC, GIMPLE and >>> (non-UNSPEC) RTL and any target-independent built-in functions that >>> use such numbers. (The issue here being, as you saw, that the lane >>> numbers used in ARM-specific NEON intrinsics are for big-endian not >>> the same as those used in target-independent features of GNU C and >>> target-independent internal representations in GCC - hence various >>> code to translate them between the two conventions when processing >>> intrinsics into non-UNSPEC RTL, and to translate back when >>> generating assembly instructions that encode lane numbers with the >>> ARM conventions, as expounded at greater length at >>> <http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html>.) >>> >> >> This is all rather horrible, and leads to THREE different layouts for >> a 128-bit vector for big-endian Neon. >> >> GCC format >> 'VLD1.n' format >> 'ABI' format >> >> GCC format and 'ABI' format differ in that the 64-bit words of the >> 128-bit vector are swapped. >> >> All this and they are all expected to share a single machine mode. >> >> Furthermore, the definitions in GCC are broken, in that the types >> defined in arm_neon.h (eg int8x16_t) are supposed to be ABI format, >> not GCC format. >> >> Eukkkkkk! :-( > > FWIW, I thought long and hard about this problem, and eventually gave > up trying to solve it. Note that many operations which depend on the > ordering of vectors are now disabled entirely (at least for Q regs) in > neon.md in big-endian mode to try and limit the damage. NEON is > basically only supported properly in little-endian mode, IMO. > > I'd love to see this resolved properly. Some random observations: > > * The vectorizer can use whatever layout it wants for vectors in > either endianness. Vectorizer vectors never interact with either > GCC generic (source-level) vectors, nor the NEON intrinsics. Also > they never cross ABI boundaries. > > * GCC generic vectors aren't specified very formally, particularly wrt. > their interaction with NEON intrinsics. If you stick *entirely* to > accessing vectors via NEON intrinsics, the problems in big-endian > mode (I think) don't ever materialise. This includes not using > indirection to load/store vectors, and (of course) not constructing > vectors using { x, y, z... } syntax. One possibility might be to > detect and *disallow* code which attempts to mix vector operations > like that. > > I don't quite understand your comment about the GCC definitions of > int8x16_t etc. being broken, tbh... >
the 128-bit vectors are loaded as a pair of D regs, with D<n> holding the lower addressed D-word and D<n+1> holding the higher addressed D-word; but these are treated in a Q reg as {D<n+1>:D<n>}. On a big-endian machine that means D<n> contains the most significant lanes of the vector and D<n+1> the least significant lanes. For a big-endian view we really need to see these as {D<n>:D<n+1>} (read {a:b} as bit-wise concatenation of a and b). One way we might address this is to redefine our 128-bit vector types as structs of low/high Dwords. Each Dword remains a vector (apart from 64-bit lane types), but the Dword order then matches the ABI specification correctly. For example, the definition of uint8x16_t becomes typedef struct { uint8x8_t _val[2]; } uint8x16_t; that is we consider this to be a pair of 64-bit vectors. Obviously there would be similar definitions for the other vector types. This then gives the correct view on the world because D<n> is always _val[0] and D<n+1> is always _val[1]. Secondly, all vector loads/stores should really be changed to use vld1.64 (with {d<n>, d<n+1>} as the register list for 128-bit accesses) rather than vldm; this then sorts out any issues with unaligned accesses without changing the memory format. > Cheers, > > Julian >