On Wed, 16 Jun 2021, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > This adds a simple reduction vectorization capability to the > > non-loop vectorizer. Simple meaning it lacks any of the fancy > > ways to generate the reduction epilogue but only supports > > those we can handle via a direct internal function reducing > > a vector to a scalar. One of the main reasons is to avoid > > massive refactoring at this point but also that more complex > > epilogue operations are hardly profitable. > > > > Mixed sign reductions are for now fend off and I'm not finally > > settled with whether we want an explicit SLP node for the > > reduction epilogue operation. Handling mixed signs could be > > done by multiplying with a { 1, -1, .. } vector. Fend off > > are also reductions with non-internal operands (constants > > or register parameters for example). > > > > Costing is done by accounting the original scalar participating > > stmts for the scalar cost and log2 permutes and operations for > > the vectorized epilogue. > > It would be good if we have had a standard way of asking for this > cost for both loops and SLP, perhaps based on the internal function. > E.g. for aarch64 we have a cost table that gives a more precise cost > (and log2 of the scalar op isn't always it :-)). > > I don't have any specific suggestion how though. And I guess it > can be a follow-on patch anyway.
Yeah, the only idea I came up with that would work "now" would be to build a fake gimple stmt and feed that to the add_stmt_cost hook ... In the end the cost hook will be passed the SLP nodes, but then at the moment the reduction op doesn't have any but it's implicit in the SLP instance kind "root" handling. We would need to add a lane-reducing SLP node kind, I'd rather not abuse VEC_PERM_EXPR nodes directly here. Theres some PR asking for straight-line vectorization use of SAD which usually ends up doing a 16xchar -> 4xint "reduction" - we'd need some way to lay out input / output lanes for such operation (and of course specify the reduction operation). The root SLP node for a reduction-to-scalar operation would then be .IFN_REDUC_PLUS_SCAL and it would have a single output lane. But as said I would want it to handle "partial" reductions as well, like for SAD pattern detection. Maybe treating it as black-box is good enough though - who knows ;) Thus for now I went with a conservative estimate - I hope it _is_ conservative in all cases, is it? (not exactly as can be seen below) Well, if all fails adding some entry to vect_cost_for_stmt would work as well I guess. We do pass the last stmt of the reduction and with say vector_reduc_to_scalar the backend could second-guess the actual operation carried out - but then it would have to, since the default hook implementations are not selective on new cost kinds. > > SPEC CPU 2017 FP with rate workload measurements show (picked > > fastest runs of three) regressions for 507.cactuBSSN_r (1.5%), > > 508.namd_r (2.5%), 511.povray_r (2.5%), 526.blender_r (0.5) and > > 527.cam4_r (2.5%) and improvements for 510.parest_r (5%) and > > 538.imagick_r (1.5%). This is with -Ofast -march=znver2 on a Zen2. > > > > Statistics on CPU 2017 shows that the overwhelming number of seeds > > we find are reductions of two lanes (well - that's basically every > > associative operation). That means we put a quite high pressure > > on the SLP discovery process this way. > > > > In total we find 583218 seeds we put to SLP discovery out of which > > 66205 pass that and only 6185 of those make it through > > code generation checks. 796 of those are discarded because the reduction > > is part of a larger SLP instance. 4195 of the remaining > > are deemed not profitable to vectorize and 1194 are finally > > vectorized. That's a poor 0.2% rate. > > Oof. Yeah, that's of course because every single 'plus' is a reduction of two lanes... > > Of the 583218 seeds 486826 (83%) have two lanes, 60912 have three (10%), > > 28181 four (5%), 4808 five, 909 six and there are instances up to 120 > > lanes. > > > > There's a set of 54086 candidate seeds we reject because > > they contain a constant or invariant (not implemented yet) but still > > have two or more lanes that could be put to SLP discovery. > > It looks like the patch doesn't explicitly forbid 2-element reductions > and instead relies on the cost model. Is that right? Yes, because it's the only "seed" we'd start trying to vectorize like (x[0] * (b[0] + d[0])) + (x[1] * (b[1] + d[1])). But yes, I do hope that for plain x[0] + x[1] we either say costing isn't worthwhile or we generate reasonable code. On x86_64 it produces (with generic costing): movupd (%rdi), %xmm1 movapd %xmm1, %xmm0 unpckhpd %xmm1, %xmm0 addpd %xmm1, %xmm0 compared to movsd (%rdi), %xmm0 addsd 8(%rdi), %xmm0 :/ But this also feels like some missed optimization on RTL. The x86 backend expands .REDUC_PLUS to full vector operations here while the last op could have used a addps (pd is chosen for code size reasons). > > Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also > > built and tested SPEC CPU 2017 with -Ofast -march=znver2 successfully. > > > > I do think this is good enough(TM) for this point, please speak up > > if you disagree and/or like to see changes. > > No objection from me FWIW. Looks like a nice feature :-) It was the last obvious seed candidate for looking for SLP opportunities before starting to match up lanes from random equal looking operations (with the caveat that we need to extract all lanes as scalars from the result). Richard. > Thanks, > Richard > > > > > Thanks, > > Richard. > > > > 2021-06-16 Richard Biener <rguent...@suse.de> > > > > PR tree-optimization/54400 > > * tree-vectorizer.h (enum slp_instance_kind): Add > > slp_inst_kind_bb_reduc. > > (reduction_fn_for_scalar_code): Declare. > > * tree-vect-data-refs.c (vect_slp_analyze_instance_dependence): > > Check SLP_INSTANCE_KIND instead of looking at the > > representative. > > (vect_slp_analyze_instance_alignment): Likewise. > > * tree-vect-loop.c (reduction_fn_for_scalar_code): Export. > > * tree-vect-slp.c (vect_slp_linearize_chain): Split out > > chain linearization from vect_build_slp_tree_2 and generalize > > for the use of BB reduction vectorization. > > (vect_build_slp_tree_2): Adjust accordingly. > > (vect_optimize_slp): Elide permutes at the root of BB reduction > > instances. > > (vectorizable_bb_reduc_epilogue): New function. > > (vect_slp_prune_covered_roots): Likewise. > > (vect_slp_analyze_operations): Use them. > > (vect_slp_check_for_constructors): Recognize associatable > > chains for BB reduction vectorization. > > (vectorize_slp_instance_root_stmt): Generate code for the > > BB reduction epilogue. > > > > * gcc.dg/vect/bb-slp-pr54400.c: New testcase. > > --- > > gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c | 43 +++ > > gcc/tree-vect-data-refs.c | 9 +- > > gcc/tree-vect-loop.c | 2 +- > > gcc/tree-vect-slp.c | 383 +++++++++++++++++---- > > gcc/tree-vectorizer.h | 2 + > > 5 files changed, 367 insertions(+), 72 deletions(-) > > create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)