I meant to send this to the "external" Linaro toolchain mailing list,
not the internal CS one. Apologies to those who receive it twice!
In a follow-up message, Joseph Myers pointed out a post he'd written
previously on the same subject:
http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html
In further followups (at the risk of misrepresenting Joseph & Paul
Brook's opinions!), there seemed to be general agreement that a scheme
something like that outlined below, with "permuting" loads/stores and
some way of handling multiple in-register layouts for vectors seems
like it will be a necessary addition to the vectorizer, going forward.
Julian
Begin forwarded message:
Date: Thu, 7 Oct 2010 16:45:17 +0100
From: Julian Brown
To: Ira Rosen
Cc: Tejas Belagod , Linaro List
Subject: [gnu-linaro-tools] NEON
vectorization: use of specialized load/store instructions
Hi,
We're having some system issues, so I thought I'd take the chance to
write down some things I've been thinking about re: utilising the NEON
load/store instructions more effectively. I've also attempted to
summarize the problems with big-endian mode. All unverified as of yet,
so please take with a pinch of salt :-). Comments appreciated. It's
been a while since I last thought about some of this stuff...
Cheers,
Julian
Use of specialized load instructions
To provide good support for NEON's element and structure load/store
instructions, GCC lacks support for a couple of key features:
1. A good way of representing a set of two, three or four vector
registers (either D- or Q-sized), possibly with non-unit stride.
2. A generalised mapping between memory locations and lane numbers.
To start with point 1: currently the element and structure load/store
instructions are only supported via intrinsics. These are specified to
load and store as if going via an array embedded in a union, i.e.:
typedef struct int8x8x2_t
{
int8x8_t val[2];
} int8x8x2_t;
__extension__ static __inline int8x8x2_t __attribute__
((__always_inline__)) vld2_s8 (const int8_t * __a)
{
union { int8x8x2_t __i; __builtin_neon_ti __o; } __rv;
__rv.__o = __builtin_neon_vld2v8qi ((const __builtin_neon_qi *) __a);
return __rv.__i;
}
Even for a trivial test program, e.g.:
#include
int foo (int8_t *x)
{
int8x8x2_t result = vld2_s8 (x);
return vget_lane_s8 (result.val[0], 1);
}
We will generate code like so:
sub sp, sp, #32
vld2.8 {d16-d17}, [r0]
mov r3, sp
vstmia sp, {d16-d17}
add ip, sp, #16
ldmia r3, {r0, r1, r2, r3}
stmia ip, {r0, r1, r2, r3}
flddd16, [sp, #16]
vmov.s8 r0, d16[1]
add sp, sp, #32
bx lr
I.e., rather than being used directly, the registers loaded by vld2
will always be spilled to the stack then reloaded. This obviously
reduces the usefulness of these intrinsics by a large factor. With some
planning, it'd be good to find a powerful enough solution to this
problem so that the same representation for multiple registers can be
used by the autovectorizer as well as the intrinsic-handling code.
(One difficulty is that the "foo.val[X]" interface should still be
available to user code. There's probably no need for "val" to literally
be an array, though other representations would require front-end
changes).
Assuming it's hard for the register allocator to deal with
highly-constrained situations like requiring four consecutive
registers, one (ugly) possibility might be to run a pass before
register allocation, looking for "big" multi-register vectors and
pre-allocating them to hard registers. Even using a fixed allocation of
a single set of registers (e.g. make it so that all multi-reg
loads/stores larger than a Q register must use d0-d7, or whatever)
would probably give better code than what we produce at present, in
most cases.
Now, point 2. To start with, an aside: AIUI, there is currently an
assumption in the vectoriser code that increasing element numbers in
vector registers correspond to increasing addresses when those
registers are loaded from and stored to memory (as if the vector was a
short array, or alternatively as if a union of the vector register and
an array of element-types had the same numberings for lanes and array
indices corresponding to the same elements). Unfortunately that is only
true for NEON in little-endian mode: in big-endian mode, the story is
more complicated, for reasons I will try to explain.
To remain compliant with the soft-float variant of the ARM EABI, we
must pass vector register arguments in ARM registers (or the stack),
not vector registers. This means that we must be very careful with the
ordering of elements for values passed to functions. Consider the
trivial function:
int __attribute__((noinline)) qux (int16x8_t x)
{
x = vaddq_s16 (x, x);
return vgetq_lane_s16 (x, 1);
}
This is compiled by GCC to the following (slightly unimpressively):
vmov