Hi, I am resuming investigations about disabling peeling for alignment (see thread at http://gcc.gnu.org/ml/gcc/2012-12/msg00036.html).
As a reminder, I have a simple patch which disables peeling unconditionally and gives some improvement in benchmarks. However, I've noticed a regression where a reduced test case is: #define SIZE 8 void func(float *data, float d) { int i; for (i=0; i<SIZE; i++) data[i] = d; } With peeling enabled, the compiler generates: fsts s0, [r0] fsts s0, [r0, #4] fsts s0, [r0, #8] fsts s0, [r0, #12] fsts s0, [r0, #16] fsts s0, [r0, #20] fsts s0, [r0, #24] fsts s0, [r0, #28] with my patch, the compiler generates: vdup.32 q0, d0[0] vst1.32 {q0}, [r0]! vst1.32 {q0}, [r0] bx lr The performance regression is mostly caused by the dependency between vdup and vst1 (removing the dependency on r0 post-increment did not show any perf improvement). I have tried to modify the vectorizer cost model such that scalar->vector stmts have higher cost than currently with the hope that the loop prologue would become too expensive; but to reach this level, this cost needs to be increased quite a lot, so this approach does not seem right. The vectorizer estimates the cost of the prologue/epilogue/loop body with and without vectorization and computes the number of iterations needed for profitability. In the present case, keeping reasonable costs, this number is very low (2 or 3 typically), while the compiler knows we have 8 iterations for sure. I think we need something to describe the dependency between vdup and vst1. Otherwise, from the vectorizer point of view, this looks like an ideal loop. Do you have suggestions on how to tackle this? (I've just had a look at the recent vectorizer cost model modification, which doesn't seem to handle this case.) Thanks, Christophe. On 13 December 2012 10:42, Richard Biener <richard.guent...@gmail.com> wrote: > On Wed, Dec 12, 2012 at 6:50 PM, Andi Kleen <a...@firstfloor.org> wrote: >> "H.J. Lu" <hjl.to...@gmail.com> writes: >>> >>> i386.c has >>> >>> { >>> /* When not optimize for size, enable vzeroupper optimization for >>> TARGET_AVX with -fexpensive-optimizations and split 32-byte >>> AVX unaligned load/store. */ >> >> This is only for the load, not for deciding whether peeling is >> worthwhile or not. >> >> I believe it's unimplemented for x86 at this point. There isn't even a >> hook for it. Any hook that is added should ideally work for both ARM64 >> and x86. This would imply it would need to handle different vector >> sizes. > > There is > > /* Implement targetm.vectorize.builtin_vectorization_cost. */ > static int > ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, > tree vectype, > int misalign ATTRIBUTE_UNUSED) > { > ... > case unaligned_load: > case unaligned_store: > return ix86_cost->vec_unalign_load_cost; > > which indeed doesn't distinguish between unaligned load/store cost. Still > it does distinguish between aligned and unaligned load/store cost. > > Now look at the cost tables and see different unaligned vs. aligned costs > dependent on the target CPU. > > generic32 and generic64 have: > > 1, /* vec_align_load_cost. */ > 2, /* vec_unalign_load_cost. */ > 1, /* vec_store_cost. */ > > The missed piece in the vectorizer is that peeling for alignment should have > the > option to turn itself off based on that costs (and analysis). > > Richard.