slp-gap-1.c FAILs

ams at gcc dot gnu.org via Gcc-bugs Mon, 03 Jun 2024 07:11:53 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304


--- Comment #11 from Andrew Stubbs <ams at gcc dot gnu.org> ---
(In reply to rguent...@suse.de from comment #10)
> On Mon, 3 Jun 2024, ams at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304
> > 
> > --- Comment #9 from Andrew Stubbs <ams at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #6)
> > > The best strathegy for GCN would be to gather V4QImode aka SImode into the
> > > V64QImode (or V16SImode) vector.  For pix2 we have a gap of 28 elements,
> > > doing consecutive loads isn't a good strategy here.
> > 
> > I don't fully understand what you're trying to say here, so apologies if you
> > knew all this already and I missed the point.....
> > 
> > In general, on GCN V4QImode is not in any way equivalent to SImode (when the
> > values are in registers). The vector registers are not one single string of
> > re-interpretable bits.
> > 
> > For the same reason, you can't load a value as V64QImode and then try to
> > interpret it as V16SImode. GCN vector registers just don't work like
> > SSE/Neon/etc.
> > 
> > When you load a V64QImode vector, each lane is extended to 32 bits, so what 
> > you
> > actually get in hardware is a V64SImode vector.
> > 
> > Likewise, when you load a V4QImode vector the hardware representation is
> > actually V4SImode (which in itself is just V64SImode with undefined values 
> > in
> > the unused lanes).
> 
> I see.  I wonder if there's not one or two latent wrong-code because of
> this and the vectorizers assumptions ;)  I suppose modes_tieable_p
> will tell us whether a VIEW_CONVERT_EXPR will do the right thing?
> Is GET_MODE_SIZE (V64QImode) == GET_MODE_SIZE (V64SImode) btw?
> And V64QImode really V64PSImode?

The mode size says how big it will be when written to memory, so no they're not
the same. I believe this matches the scalar QImode behaviour.

We don't use any PSI modes. There are (some) machine instructions for V64QImode
(and V64HImode) so we don't want to lose that information.

There may well be some bugs, but we have handling for conversions in a number
of places. There are truncate and extend patterns that operate lane-wise, and
vec_extract can take a subset of a vector, IIRC.

> Still for a V64QImode load on { c[0], c[1], c[2], c[3], c[32], c[33], 
> c[34], c[35], ... } it's probably best to use a single V64QImode gather 
> with GCN then rather than four "consecutive" V64QImode loads and then
> element swizzling.

Fewer loads are always better, and permutations are expensive operations (and
don't work with 64-lane vectors on RDNA devices because they're actually two
32-lane vectors stuck together) so it can certainly make sense to use gather
with a vector of permuted offsets (although it can be expensive to generate
that vector in the first place).

[Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs

Reply via email to