https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #7)
> FWIW, Peter Cordes provides an overview of available approaches for
> extraction depending on vector length and ISA extensions (up to AVX2, not
> including AVX-512) in this StackOverflow answer:
> https://stackoverflow.com/a/51414330/4755075
> 
> TL;DR: generally through store+load; possible alternatives:
>  128b:
>   SSSE3: pshufb          (1-byte elements)
>   SSSE3: imul+add+pshufb (any element size)
>   AVX: vpermilp[sd] (4 or 8-byte elements)
>  256b:
>   AVX2: vpermps (4-byte elements)
> 
> In all cases a (v)movd is needed to move the index to a vector register, and
> potentially another (v)movd if the result is needed in a general register.
> 
> The basic store+load tactic may look worse latency-wise, but can be better
> throughput-wise (especially with multiple extractions from the same vector,
> as then the store needs to be done just once, as Peter mentioned).
> 
> Why in RTL it is important to do this without referencing the stack?

For extraction it isn't absolutely required to do this w/o the stack
since the spill would cover the whole vector and the reads can be
usually handled with store-forwarding in the CPUs.  So here this
can be fully based on cost.

The insert case is instead very bad here with a whole-vector store
followed by an element store and then a whole-vector load.  This
sequence will usually cause at least additional latency or worse
recovering from a bad store-forwarding.

Note that currently RTL expansion forces a local vector typed variable
to the stack (instead of allocating a pseudo) when there are
variable-index accesses to it.  That might be a reason to also handle
slightly "expensive" extract cases.  But I guess later falling back
to a stack slot via a splitter or LRA will lead to worse code.

Reply via email to