https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
We should not put any stock in what ICC does for GNU C native vector indexing. 
I think it doesn't know how to optimize that because it *always* spills/reloads
even for `vec[0]` which could be a no-op.  And it's always a full-width spill
(ZMM), not just the low XMM/YMM part that contains the desired element.  I
mainly mentioned ICC in my initial post to suggest the store/reload strategy in
general as an *option*.

ICC also doesn't optimize intriniscs: it pretty much always faithfully
transliterates them to asm.  e.g. v = _mm_add_epi32(v, _mm_set1_epi32(1)); 
twice compiles to two separate paddd instructions, instead of one with a
constant of set1(2).

If we want to see ICC's strided-store strategy, we'd need to write some pure C
that auto-vectorizes.

----

That said, store/reload is certainly a valid option when we want all the
elements, and gets *more* attractive with wider vectors, where the one extra
store amortizes over more elements.

Strided stores will typically bottleneck on cache/memory bandwidth unless the
destination lines are already hot in L1d.  But if there's other work in the
loop, we care about OoO exec of that work with the stores, so uop throughput
could be a factor.


If we're tuning for Intel Haswell/Skylake with 1 per clock shuffles but 2 loads
+ 1 store per clock throughput (if we avoid indexed addressing modes for
stores), then it's very attractive and unlikely to be a bottleneck.

There's typically spare load execution-unit cycles in a loop that's also doing
stores + other work.  You need every other uop to be (or include) a load to
bottleneck on that at 4 uops per clock, unless you have indexed stores (which
can't run on the simple store-AGU on port 7 and need to run on port 2/3, taking
a cycle from a load).   Cache-split loads do get replayed to grab the 2nd half,
so it costs extra execution-unit pressure as well as extra cache-read cycles.

Intel says Ice will have 2 load + 2 store pipes, and a 2nd shuffle unit.  A
mixed strategy there might be interesting: extract the high 256 bits to memory
with vextractf32x8 and reload it, but shuffle the low 128/256 bits.  That
strategy might be good on earlier CPUs, too.  At least with movss + extractps
stores from the low XMM where we can do that directly.

AMD before Ryzen 2 has only 2 AGUs, so only 2 memory ops per clock, up to one
of which can be a store.  It's definitely worth considering extracting the high
128-bit half of a YMM and using movss then shuffles like vextractps: 2 uops on
Ryzen or AMD.


-----

If the stride is small enough (so more than 1 element fits in a vector), we
should consider  shuffle + vmaskmovps  masked stores, or with AVX512 then
AVX512 masked stores.

But for larger strides, AVX512 scatter may get better in the future.  It's
currently (SKX) 43 uops for VSCATTERDPS or ...DD ZMM, so not very friendly to
surrounding code.  It sustains one per 17 clock throughput, slightly worse than
1 element stored per clock cycle.  Same throughput on KNL, but only 4 uops so
it can overlap much better with surrounding code.


----

For qword elements, we have efficient stores of the high or low half of an XMM.
 A MOVHPS store doesn't need a shuffle uop on most Intel CPUs.  So we only need
1 (YMM) or 3 (ZMM) shuffles to get each of the high 128-bit lanes down to an
XMM register.

Unfortunately on Ryzen, MOVHPS [mem], xmm costs a shuffle+store.  But Ryzen has
shuffle EUs on multiple ports.

Reply via email to