https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Alexander Monakov from comment #7) > FWIW, Peter Cordes provides an overview of available approaches for > extraction depending on vector length and ISA extensions (up to AVX2, not > including AVX-512) in this StackOverflow answer: > https://stackoverflow.com/a/51414330/4755075 > > TL;DR: generally through store+load; possible alternatives: > 128b: > SSSE3: pshufb (1-byte elements) > SSSE3: imul+add+pshufb (any element size) > AVX: vpermilp[sd] (4 or 8-byte elements) > 256b: > AVX2: vpermps (4-byte elements) > > In all cases a (v)movd is needed to move the index to a vector register, and > potentially another (v)movd if the result is needed in a general register. > > The basic store+load tactic may look worse latency-wise, but can be better > throughput-wise (especially with multiple extractions from the same vector, > as then the store needs to be done just once, as Peter mentioned). > > Why in RTL it is important to do this without referencing the stack? For extraction it isn't absolutely required to do this w/o the stack since the spill would cover the whole vector and the reads can be usually handled with store-forwarding in the CPUs. So here this can be fully based on cost. The insert case is instead very bad here with a whole-vector store followed by an element store and then a whole-vector load. This sequence will usually cause at least additional latency or worse recovering from a bad store-forwarding. Note that currently RTL expansion forces a local vector typed variable to the stack (instead of allocating a pseudo) when there are variable-index accesses to it. That might be a reason to also handle slightly "expensive" extract cases. But I guess later falling back to a stack slot via a splitter or LRA will lead to worse code.