On Tue, Mar 22, 2011 at 5:52 PM, Richard Sandiford <richard.sandif...@linaro.org> wrote: > This is an RFC about adding gimple and optab support for things like > ARM's load-lane and store-lane instructions. It builds on an earlier > discussion between Ira and Julian, with the aim of allowing these > instructions to be used by the vectoriser. > > These instructions operate on N vector registers of M elements each and > on a sequence of 1 or M N-element structures. They come in three forms: > > - full load/store: > > 0<=I<N, 0<=J<M, register[I][J] = memory[J*M+I] > > E.g., for N=3, M=4: > > Registers Memory > ---------------- --------------- > RRRR GGGG BBBB <---> RGB RGB RGB RGB > > - lane load/store: > > given L, 0<=I<N register[I][L] = memory[I] > > E.g., for N=3. M=4, L=2: > > Registers Memory > ---------------- --------------- > ..R. ..G. ..B. <---> RGB > > - load-and-duplicate: > > 0<=I<N, 0<=J<M, register[I][J] = memory[I] > > E.g. for N=3 V4HIs: > > Registers Memory > ---------------- ---------------- > RRRR GGGG BBBB <---- RGB > > Starting points: > > 1) Memory references should be MEM_REFs at the gimple level. > We shouldn't add new tree codes for memory references. > > 2) Because of the large data involved (at least in the "full" case), > the gimple statement that represents the lane interleaving should > also have the MEM_REF. The two shouldn't be split between > statements. > > 3) The ARM doubleword instructions allow the N vectors to be in > consecutive registers (DM, DM+1, ...) or in every second register > (DM, DM+2, ...). However, the latter case is only interesting > if we're dealing with halves of quadword vectors. It's therefore > reasonable to view the N vectors as one big value. > > (3) significantly simplifies things at the rtl level for ARM, because it > avoids having to find some way of saying that N separate pseudos must > be allocated to N consecutive hard registers. If other targets allow the > N vectors to be stored in arbitrary (non-consecutive) registers, then > they could split the register up into subregs at expand time. > The lower-subreg pass should then optimise things nicely. > > The easiest way of dealing with (1) and (2) seems to be to model the > operations as built-in functions. And if we do treat the N vectors as > a single value, the load functions can simply return that value. So we > could have something like: > > - full load/store: > > combined_vectors = __builtin_load_lanes (memory); > memory = __builtin_store_lanes (combined_vectors); > > - lane load/store: > > combined_vectors = __builltin_load_lane (memory, combined_vectors, lane); > memory = __builtin_store_lane (combined_vectors, lane); > > - load-and-duplicate: > > combined_vectors = __builtin_load_dup (memory); > > We could then use normal component references to set or get the individual > vectors of combined_vectors. Does that sound OK so far? > > The question then is: what type should combined_vectors have? (At this > point I'm just talking about types, not modes.) The main possibilities > seemed to be: > > 1. an integer type > > Pros > * Gimple registers can store integers. > > Cons > * As Julian points out, GCC doesn't really support integer types > that are wider than 2 HOST_WIDE_INTs. It would be good to > remove that restriction, but it might be a lot of work. > > * We're not really using the type as an integer. > > * The combination of the integer type and the __builtin_load_lanes > array argument wouldn't be enough to determine the correct > load operation. __builtin_load_lanes would need something > like a vector count argument (N in the above description) as well. > > 2. a vector type > > Pros > * Gimple registers can store vectors. > > Cons > * For vld3, this would mean creating vector types with non-power- > of-two vectors. GCC doesn't support those yet, and you get > ICEs as soon as you try to use them. (Remember that this is > all about types, not modes.) > > It _might_ be interesting to implement this support, but as > above, it would be a lot of work. It also raises some tricky > semantic questions, such as: what is the alignment of the new > vectors? Which leads to... > > * The alignment of the type would be strange. E.g. suppose > we're dealing with M=2, and use uint32xY_t to represent a > vector of Y uint32_ts. The types and alignments would be: > > N=2 uint32x4_t, alignment 16 > N=3 uint32x6_t, alignment 8 (if we follow the convention for modes) > N=4 uint32x8_t, alignment 32 > > We don't need alignments greater than 8 in our intended use; > 16 and 32 are overkill. > > * We're not really using the type as a single vector, > but as a collection of vectors. > > * The combination of the vector type and the __builtin_load_lanes > array argument wouldn't be enough to determine the correct > load operation. __builtin_load_lanes would need something > like a vector count argument (N in the above description) as well. > > 3. an array-of-vectors type > > Pros > * No support for new GCC features (large integers or non-power-of-two > vectors) is needed. > > * The alignment of the type would be taken from the alignment of the > individual vectors, which is correct. > > * It accurately reflects how the loaded value is going to be used. > > * The type uniquely identifies the correct load operation, > without need for additional arguments. (This is minor.) > > Cons > * Gimple registers can't store array values.
Simple. Just make them registers anyway (I did that in the past when working on middle-end arrays). You'd set DECL_GIMPLE_REG_P on the decl. 4. a vector-of-vectors type Cons * I don't think we want that ;) Using an array type sounds like the only sensible option to me apart from using a large non-power-of-two vector type (but then you'd have the issue of what operations operate on, see below). > So I think the only disadvantage of using an array of vectors is that the > result can never be a gimple register. But that isn't much of a disadvantage > really; the things we care about are the individual vectors, which can > of course be treated as gimple registers. I think our tracking of memory > values is good enough for combined_vectors to be treated as such. > > These arrays of vectors would still need to have a non-BLK mode, > so that they can be stored in _rtl_ registers. But we need that anyway > for ARM's arm_neon.h; the code that today's GCC produces for the intrinsic > functions is very poor. > > So how about the following functions? (Forgive the pascally syntax.) > > __builtin_load_lanes (REF : array N*M of X) > returns array N of vector M of X > maps to vldN on ARM > in practice, the result would be used in assignments of the form: > vectorY = ARRAY_REF <result, Y> > > __builtin_store_lanes (VECTORS : array N of vector M of X) > returns array N*M of X > maps to vstN on ARM > in practice, the argument would be populated by assignments of the form: > ARRAY_REF <VECTORS, Y> = vectorY > > __builtin_load_lane (REF : array N of X, > VECTORS : array N of vector M of X, > LANE : integer) > returns array N of vector M of X > maps to vldN_lane on ARM > > __builtin_store_lane (VECTORS : array N of vector M of X, > LANE : integer) > returns array N of X > maps to vstN_lane on ARM > > __builtin_load_dup (REF : array N of X) > returns array N of vector M of X > maps to vldN_dup on ARM > > I've hacked up a prototype of this and it seems to produce good code. > What do you think? How do you expect these to be used? That is, would you ever expect components of those large vectors/arrays be used in operations like add, or does the HW provide vector-lane variants for those as well? Thus, will for (i=0; i<N; ++i) X[i] = Y[i] + Z[i]; result in a single add per vector lane load or a single vector lane load for M "unrolled" instances of (small) vector adds? If the latter then we have to think about indexing the vector lanes as well as allowing partial stores (or have a vector-lane construct operation). Representing vector lanes as automatic memory (with array of vector type) makes things easy, but eventually not very efficient. I had new tree/stmt codes for array loads/stores for middle-end arrays. Eventually the vector lane support can at least walk in the same direction that middle-end arrays would ;) Richard. > Richard >