https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103
--- Comment #4 from Peter Cordes <peter at cordes dot ca> --- We should not put any stock in what ICC does for GNU C native vector indexing. I think it doesn't know how to optimize that because it *always* spills/reloads even for `vec[0]` which could be a no-op. And it's always a full-width spill (ZMM), not just the low XMM/YMM part that contains the desired element. I mainly mentioned ICC in my initial post to suggest the store/reload strategy in general as an *option*. ICC also doesn't optimize intriniscs: it pretty much always faithfully transliterates them to asm. e.g. v = _mm_add_epi32(v, _mm_set1_epi32(1)); twice compiles to two separate paddd instructions, instead of one with a constant of set1(2). If we want to see ICC's strided-store strategy, we'd need to write some pure C that auto-vectorizes. ---- That said, store/reload is certainly a valid option when we want all the elements, and gets *more* attractive with wider vectors, where the one extra store amortizes over more elements. Strided stores will typically bottleneck on cache/memory bandwidth unless the destination lines are already hot in L1d. But if there's other work in the loop, we care about OoO exec of that work with the stores, so uop throughput could be a factor. If we're tuning for Intel Haswell/Skylake with 1 per clock shuffles but 2 loads + 1 store per clock throughput (if we avoid indexed addressing modes for stores), then it's very attractive and unlikely to be a bottleneck. There's typically spare load execution-unit cycles in a loop that's also doing stores + other work. You need every other uop to be (or include) a load to bottleneck on that at 4 uops per clock, unless you have indexed stores (which can't run on the simple store-AGU on port 7 and need to run on port 2/3, taking a cycle from a load). Cache-split loads do get replayed to grab the 2nd half, so it costs extra execution-unit pressure as well as extra cache-read cycles. Intel says Ice will have 2 load + 2 store pipes, and a 2nd shuffle unit. A mixed strategy there might be interesting: extract the high 256 bits to memory with vextractf32x8 and reload it, but shuffle the low 128/256 bits. That strategy might be good on earlier CPUs, too. At least with movss + extractps stores from the low XMM where we can do that directly. AMD before Ryzen 2 has only 2 AGUs, so only 2 memory ops per clock, up to one of which can be a store. It's definitely worth considering extracting the high 128-bit half of a YMM and using movss then shuffles like vextractps: 2 uops on Ryzen or AMD. ----- If the stride is small enough (so more than 1 element fits in a vector), we should consider shuffle + vmaskmovps masked stores, or with AVX512 then AVX512 masked stores. But for larger strides, AVX512 scatter may get better in the future. It's currently (SKX) 43 uops for VSCATTERDPS or ...DD ZMM, so not very friendly to surrounding code. It sustains one per 17 clock throughput, slightly worse than 1 element stored per clock cycle. Same throughput on KNL, but only 4 uops so it can overlap much better with surrounding code. ---- For qword elements, we have efficient stores of the high or low half of an XMM. A MOVHPS store doesn't need a shuffle uop on most Intel CPUs. So we only need 1 (YMM) or 3 (ZMM) shuffles to get each of the high 128-bit lanes down to an XMM register. Unfortunately on Ryzen, MOVHPS [mem], xmm costs a shuffle+store. But Ryzen has shuffle EUs on multiple ports.