Richard Guenther <richard.guent...@gmail.com> writes: > On Mon, Apr 18, 2011 at 1:24 PM, Richard Sandiford > <richard.sandif...@linaro.org> wrote: >> Richard Guenther <richard.guent...@gmail.com> writes: >>> On Tue, Apr 12, 2011 at 3:59 PM, Richard Sandiford >>> <richard.sandif...@linaro.org> wrote: >>>> Index: gcc/doc/md.texi >>>> =================================================================== >>>> --- gcc/doc/md.texi 2011-04-12 12:16:46.000000000 +0100 >>>> +++ gcc/doc/md.texi 2011-04-12 14:48:28.000000000 +0100 >>>> @@ -3846,6 +3846,48 @@ into consecutive memory locations. Oper >>>> consecutive memory locations, operand 1 is the first register, and >>>> operand 2 is a constant: the number of consecutive registers. >>>> >>>> +@cindex @code{vec_load_lanes@var{m}@var{n}} instruction pattern >>>> +@item @samp{vec_load_lanes@var{m}@var{n}} >>>> +Perform an interleaved load of several vectors from memory operand 1 >>>> +into register operand 0. Both operands have mode @var{m}. The register >>>> +operand is viewed as holding consecutive vectors of mode @var{n}, >>>> +while the memory operand is a flat array that contains the same number >>>> +of elements. The operation is equivalent to: >>>> + >>>> +@smallexample >>>> +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n}); >>>> +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++) >>>> + for (i = 0; i < c; i++) >>>> + operand0[i][j] = operand1[j * c + i]; >>>> +@end smallexample >>>> + >>>> +For example, @samp{vec_load_lanestiv4hi} loads 8 16-bit values >>>> +from memory into a register of mode @samp{TI}@. The register >>>> +contains two consecutive vectors of mode @samp{V4HI}@. >>> >>> So vec_load_lanestiv2qi would load ... ? c == 8 here. Intuitively >>> such operation would have adjacent blocks of siv2qi memory. But >>> maybe you want to constrain the mode size to GET_MODE_SIZE (@var{n}) >>> * GET_MODE_NUNITS (@var{n})? In which case the mode m is >>> redundant? You could specify that we load NUNITS adjacent vectors into >>> an integer mode of appropriate size. >> >> Like you say, vec_load_lanestiv2qi would load 16 QImode elements into >> 8 consecutive V2QI registers. The first element from register vector I >> would come from operand1[I] and the second element would come from >> operand1[I + 8]. That's meant to be a valid combination. > > Ok, but the C loop from the example doesn't seem to match. Or I couldn't > wrap my head around it despite looking for 5 minutes and already having > coffee ;) I would have expected the vectors being in memory as > > v0[0], v1[0], v0[1], v1[1], v2[0], v3[1]. v2[1], v3[1], ... > > not > > v0[0], v1[0], v2[0], ... > > as I would have thought the former is more useful (simple unrolling for > stride 2).
The second one's right. All lane 0 elements, followed by all lane 1 elements, etc. I think that's what the C loop says. > We'd need a separate set of optabs for such an interleaving > scheme? In which case we might want to come up with a more > specific name than load_lane? Yeah, if someone has a single instruction that does your first example, then it would need a new optab. The individual vector pairs could be represented using the current optab though, if each pair needs a separate instruction. E.g. with your v2qi example, vec_load_lanessiv2qi would load: v0[0], v1[0], v0[1], v1[1] and you could repeat for the others. So load_lanes (as defined here) could be treated as a primitive, and your first example could be something like "repeat_load_lanes". If you don't like the name "load_lanes" though, I'm happy to use something else. Richard