https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65660
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Looks like we now vectorize using loop vect instead of basic-block
vectorization. The overhead might be noticable. For example
./ggSpectrum.h:48:4: note: loop vectorized
-./ggSpectrum.h:49:18: note: basic block vectorized
-./ggSpectrum.h:49:18: note: basic block vectorized
-ggPathDielectricMaterial.cc:36:60: note: basic block vectorized
+./ggSpectrum.h:48:4: note: loop vectorized
+./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment
+./ggSpectrum.h:48:4: note: loop vectorized
+./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment
+./ggSpectrum.h:48:4: note: loop vectorized
+./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment
+./ggSpectrum.h:48:4: note: loop vectorized
+./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment
+./ggSpectrum.h:48:4: note: loop vectorized
+./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment
is all from ggPathDielectricMaterial.cc.
Not sure why we peel for alignment at all as bdver2 has
vec_align_load_cost == vec_unalign_load_cost == vec_store_cost (there isn't
any unaligned store cost but IIRC an unalinged store consumes two store buffers
thus aligning the stores might be profitable).
Btw, the loop in question is:
void Set(float d) {
for (int i = 0; i < nComponents(); i++)
data[i] = d;
}
where I can very well imagine that nComponents() is _not_ large enough to
warrant loop vectorization (data is an array of 8 floats). nComponents()
returns constant 8.
With bdver2 we now have
t.c:4:20: note: vectorization_factor = 4, niters = 8
t.c:4:20: note: === vect_update_slp_costs_according_to_vf ===
cost model: prologue peel iters set to vf/2.
cost model: epilogue peel iters set to vf/2 because peeling for alignment is
unknown.
t.c:4:20: note: Cost model analysis:
Vector inside of loop cost: 4
Vector prologue cost: 8
Vector epilogue cost: 0
Scalar iteration cost: 4
Scalar outside cost: 0
Vector outside cost: 8
prologue iterations: 2
epilogue iterations: 2
Calculated minimum iters for profitability: 2
t.c:4:20: note: Runtime profitability threshold = 3
t.c:4:20: note: Static estimate profitability threshold = 3
t.c:4:20: note: epilog loop required
while generic has
t.c:4:20: note: vectorization_factor = 4, niters = 8
t.c:4:20: note: === vect_update_slp_costs_according_to_vf ===
cost model: prologue peel iters set to vf/2.
cost model: epilogue peel iters set to vf/2 because peeling for alignment is
unknown.
t.c:4:20: note: Cost model analysis:
Vector inside of loop cost: 1
Vector prologue cost: 11
Vector epilogue cost: 2
Scalar iteration cost: 1
Scalar outside cost: 0
Vector outside cost: 13
prologue iterations: 2
epilogue iterations: 2
Calculated minimum iters for profitability: 17
t.c:4:20: note: Runtime profitability threshold = 16
t.c:4:20: note: Static estimate profitability threshold = 16
t.c:4:20: note: not vectorized: vectorization not profitable.
somehow the prologue cost looks off for bdver2.
Testcase:
struct ggSpectrum {
void Set (float d)
{
for (int i = 0; i < 8; i++)
data[i] = d;
}
float data[8];
};
void foo (ggSpectrum *s, float d)
{
s->Set(d);
}
now the best course of action is of course to not even consider peeling
this loop for alignment ... (if it can otherwise vectorize).
I think we run into round-off errors with my fix on bdver2, I have a crude
fix for that.