https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #34 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can confirm this observation on Zen2.  Note perf still records STLF failures
for these cases it just seems that the penalties are well hidden with the
high store load on the caller side for small NUM?

I'm not sure how well CPUs handle OOO execution across calls here, but I'm
guessing that for cray there's only dependent instructions on the STLF
failing loads while for your test the result is always stored to memory.

Not sure if there's a good way to add in a "serializing" instruction
instead of a store - if we'd write vector code directly we'd return
an accumulation vector and pass that as input to further iterations
but that makes it difficult to compare to the scalar variant.

If we look at

double
NUM2
scalar: 2.67811
   vec: 2.46635: no!!
  vecn: 2.22982

then we do see a slight penalty to the case with successful STLF, but
I suspect the main load of the test is the 9 vector stores in the caller.

What's odd though is

NUM4
scalar: 3.19169
   vec: 8.2489: penalty
  vecn: 2.25086

we still have the "same" assembly in foo, just using %ymm instead of %xmm


I'll also note that foo2n vs foo2 access stores of different distance:

void
__attribute__ ((noipa))
foo (TYPE* x, TYPE* y, TYPE* __restrict p)
{
  p[0] = x[0] + y[0];
  p[1] = x[1] + y[1];
}

vs.

void
__attribute__ ((noipa))
foo (TYPE* x, TYPE* y, TYPE* __restrict p)
{
  p[0] = x[15] + y[15];
  p[1] = x[16] + y[16];
}      

shouldn't the former access x[14] and x[15]?  Also on Zen2 using
512 byte vector stores in main() causes them to be decomposed to
128 byte vector stores, not in generic vector lowering which should
choose 256 byte vector stores but during RTL expansion.  So we have
to avoid this, otherwise the vecn cases with larger vector sizes
will fail to STLF as well.

With the two possible issues resolved I get

char
NUM2
scalar: 2.61746
   vec: 6.99399
  vecn: 2.17881
NUM4
scalar: 3.04455
   vec: 5.6571
  vecn: 2.17512
NUM8
scalar: 3.99576
   vec: 5.64829
  vecn: 2.18647
NUM16
scalar: 5.71159
   vec: 5.70879
  vecn: 2.222
short
NUM2
scalar: 2.63836
   vec: 5.92917
  vecn: 2.22295
NUM4
scalar: 3.07966
   vec: 5.93041
  vecn: 2.22694
NUM8
scalar: 4.14134
   vec: 6.16279
  vecn: 2.29287
NUM16
scalar: 5.96713
   vec: 5.91371
  vecn: 2.29854
int
NUM2
scalar: 2.74058
   vec: 2.51288
  vecn: 2.28018
NUM4
scalar: 3.22811
   vec: 2.53454
  vecn: 2.30637
NUM8
scalar: 4.14464
   vec: 6.84145
  vecn: 2.30211
NUM16
scalar: 5.97653
   vec: 7.28825
  vecn: 2.52693
int64_t
NUM2
scalar: 2.75497
   vec: 2.51353
  vecn: 2.29852
NUM4
scalar: 3.20552
   vec: 8.02914
  vecn: 2.28612
NUM8
scalar: 4.1486
   vec: 8.40673
  vecn: 2.54104
NUM16
scalar: 5.96569
   vec: 8.03334
  vecn: 2.98774
float
NUM2
scalar: 2.74666
   vec: 2.53057
  vecn: 2.29079
NUM4
scalar: 3.22499
   vec: 2.52525
  vecn: 2.29374
NUM8
scalar: 4.12471
   vec: 7.33367
  vecn: 2.30114
NUM16
scalar: 6.27016
   vec: 7.78154
  vecn: 2.53966
double
NUM2
scalar: 2.76049
   vec: 2.52339
  vecn: 2.31286
NUM4
scalar: 3.25052
   vec: 8.09372
  vecn: 2.31465
NUM8
scalar: 4.19226
   vec: 8.90108
  vecn: 2.56059
NUM16
scalar: 6.32366
   vec: 8.22693
  vecn: 3.00417

Note Zen2 has comparatively few entries in the store queue, 22 when
SMT is enabled (the 44 are statically partitioned).

What I take away from this is that modern OOO archs do not benefit much
from short sequences of low-lane vectorized code (here in particular
NUM2) since there's a good chance there's enough resources to carry
out the scalar variant in parallel.

Reply via email to