On Wed, 31 Jan 2018, Christophe Lyon wrote: > On 30 January 2018 at 11:47, Jakub Jelinek <ja...@redhat.com> wrote: > > On Tue, Jan 30, 2018 at 11:07:50AM +0100, Richard Biener wrote: > >> > >> I have been asked to push this change, fixing (somewhat) the impreciseness > >> of costing constant/invariant vector uses in SLP stmts. The previous > >> code always just considered a single constant to be generated in the > >> prologue irrespective of how many we'd need. With this patch we > >> properly handle this count and optimize for the case when we can use > >> a vector splat. It doesn't yet handle CSE (or CSE among stmts) which > >> means it could in theory regress cases it overall costed correctly > >> before "optimistically" (aka by accident). But at least the costing > >> now matches code generation. > >> > >> Bootstrapped and tested on x86_64-unknown-linux-gnu. On x86_64 > >> Haswell with AVX2 SPEC 2k6 shows no off-noise changes. > >> > >> The patch is said to help the case in the PR when additional backend > >> costing changes are done (for AVX512). > >> > >> Ok for trunk at this stage? > > > > LGTM. > > > >> 2018-01-30 Richard Biener <rguent...@suse.de> > >> > >> PR tree-optimization/83008 > >> * tree-vect-slp.c (vect_analyze_slp_cost_1): Properly cost > >> invariant and constant vector uses in stmts when they need > >> more than one stmt. > > > > Jakub > > Hi Richard, > > This patch caused a regression on aarch64*: > FAIL: gcc.dg/cse_recip.c scan-tree-dump-times optimized "rdiv_expr" 1 > (found 2 times) > we used to have: > PASS: gcc.dg/cse_recip.c scan-tree-dump-times optimized "rdiv_expr" 1
We now vectorize this on aarch64 - looks like there's a V2SFmode available. This means we get 1/x computed and divide by {x, x}. The former is non-optimal because we leave dead code around after SLP vectorization which the multi-use check of the recip pass trips on to make this transform profitable. That's worth a bugreport I think. For the testcase I'd simply adjust it to pass -fno-slp-vectorize -- or make sure to run the recip pass before vectorization. Not sure why it runs before loop optimizations? Index: gcc/passes.def =================================================================== --- gcc/passes.def (revision 257233) +++ gcc/passes.def (working copy) @@ -263,6 +263,7 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_asan); NEXT_PASS (pass_tsan); NEXT_PASS (pass_dce); + NEXT_PASS (pass_cse_reciprocals); /* Pass group that runs when 1) enabled, 2) there are loops in the function. Make sure to run pass_fix_loops before to discover/remove loops before running the gate function @@ -317,7 +318,6 @@ along with GCC; see the file COPYING3. POP_INSERT_PASSES () NEXT_PASS (pass_simduid_cleanup); NEXT_PASS (pass_lower_vector_ssa); - NEXT_PASS (pass_cse_reciprocals); NEXT_PASS (pass_sprintf_length, true); NEXT_PASS (pass_reassoc, false /* insert_powi_p */); NEXT_PASS (pass_strength_reduction); puts it right before loop opts and after a DCE pass. This results in us no longer vectorizing the code: Vector inside of basic block cost: 4 Vector prologue cost: 4 Vector epilogue cost: 0 Scalar cost of basic block: 6 /space/rguenther/src/svn/early-lto-debug/gcc/testsuite/gcc.dg/cse_recip.c:10:1: note: not vectorized: vectorization is not profitable. Not sure if we want to shuffle passes at this stage though. Richard.