https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
--- Comment #34 from Richard Biener <rguenth at gcc dot gnu.org> --- I can confirm this observation on Zen2. Note perf still records STLF failures for these cases it just seems that the penalties are well hidden with the high store load on the caller side for small NUM? I'm not sure how well CPUs handle OOO execution across calls here, but I'm guessing that for cray there's only dependent instructions on the STLF failing loads while for your test the result is always stored to memory. Not sure if there's a good way to add in a "serializing" instruction instead of a store - if we'd write vector code directly we'd return an accumulation vector and pass that as input to further iterations but that makes it difficult to compare to the scalar variant. If we look at double NUM2 scalar: 2.67811 vec: 2.46635: no!! vecn: 2.22982 then we do see a slight penalty to the case with successful STLF, but I suspect the main load of the test is the 9 vector stores in the caller. What's odd though is NUM4 scalar: 3.19169 vec: 8.2489: penalty vecn: 2.25086 we still have the "same" assembly in foo, just using %ymm instead of %xmm I'll also note that foo2n vs foo2 access stores of different distance: void __attribute__ ((noipa)) foo (TYPE* x, TYPE* y, TYPE* __restrict p) { p[0] = x[0] + y[0]; p[1] = x[1] + y[1]; } vs. void __attribute__ ((noipa)) foo (TYPE* x, TYPE* y, TYPE* __restrict p) { p[0] = x[15] + y[15]; p[1] = x[16] + y[16]; } shouldn't the former access x[14] and x[15]? Also on Zen2 using 512 byte vector stores in main() causes them to be decomposed to 128 byte vector stores, not in generic vector lowering which should choose 256 byte vector stores but during RTL expansion. So we have to avoid this, otherwise the vecn cases with larger vector sizes will fail to STLF as well. With the two possible issues resolved I get char NUM2 scalar: 2.61746 vec: 6.99399 vecn: 2.17881 NUM4 scalar: 3.04455 vec: 5.6571 vecn: 2.17512 NUM8 scalar: 3.99576 vec: 5.64829 vecn: 2.18647 NUM16 scalar: 5.71159 vec: 5.70879 vecn: 2.222 short NUM2 scalar: 2.63836 vec: 5.92917 vecn: 2.22295 NUM4 scalar: 3.07966 vec: 5.93041 vecn: 2.22694 NUM8 scalar: 4.14134 vec: 6.16279 vecn: 2.29287 NUM16 scalar: 5.96713 vec: 5.91371 vecn: 2.29854 int NUM2 scalar: 2.74058 vec: 2.51288 vecn: 2.28018 NUM4 scalar: 3.22811 vec: 2.53454 vecn: 2.30637 NUM8 scalar: 4.14464 vec: 6.84145 vecn: 2.30211 NUM16 scalar: 5.97653 vec: 7.28825 vecn: 2.52693 int64_t NUM2 scalar: 2.75497 vec: 2.51353 vecn: 2.29852 NUM4 scalar: 3.20552 vec: 8.02914 vecn: 2.28612 NUM8 scalar: 4.1486 vec: 8.40673 vecn: 2.54104 NUM16 scalar: 5.96569 vec: 8.03334 vecn: 2.98774 float NUM2 scalar: 2.74666 vec: 2.53057 vecn: 2.29079 NUM4 scalar: 3.22499 vec: 2.52525 vecn: 2.29374 NUM8 scalar: 4.12471 vec: 7.33367 vecn: 2.30114 NUM16 scalar: 6.27016 vec: 7.78154 vecn: 2.53966 double NUM2 scalar: 2.76049 vec: 2.52339 vecn: 2.31286 NUM4 scalar: 3.25052 vec: 8.09372 vecn: 2.31465 NUM8 scalar: 4.19226 vec: 8.90108 vecn: 2.56059 NUM16 scalar: 6.32366 vec: 8.22693 vecn: 3.00417 Note Zen2 has comparatively few entries in the store queue, 22 when SMT is enabled (the 44 are statically partitioned). What I take away from this is that modern OOO archs do not benefit much from short sequences of low-lane vectorized code (here in particular NUM2) since there's a good chance there's enough resources to carry out the scalar variant in parallel.