Richard Biener <rguent...@suse.de> writes: > On Tue, 4 Jun 2024, Richard Sandiford wrote: > >> Richard Biener <rguent...@suse.de> writes: >> > The following emulates classical interleaving for SLP load permutes >> > that we are unlikely handling natively. This is to handle cases >> > where interleaving (or load/store-lanes) is the optimal choice for >> > vectorizing even when we are doing that within SLP. An example >> > would be >> > >> > void foo (int * __restrict a, int * b) >> > { >> > for (int i = 0; i < 16; ++i) >> > { >> > a[4*i + 0] = b[4*i + 0] * 3; >> > a[4*i + 1] = b[4*i + 1] + 3; >> > a[4*i + 2] = (b[4*i + 2] * 3 + 3); >> > a[4*i + 3] = b[4*i + 3] * 3; >> > } >> > } >> > >> > where currently the SLP store is merging four single-lane SLP >> > sub-graphs but none of the loads in it can be code-generated >> > with V4SImode vectors and a VF of four as the permutes would need >> > three vectors. >> >> Nice! >> >> > The patch introduces a lowering phase after SLP discovery but >> > before SLP pattern recognition or permute optimization that >> > analyzes all loads from the same dataref group and creates an >> > interleaving scheme starting from an unpermuted load. >> > >> > What can be handled is quite restrictive, matching only a subset >> > of the non-SLP interleaving cases (the power-of-two group size >> > ones, in addition only cases without gaps). The interleaving >> > vectorization in addition can handle size 3 and 5 - but I am not >> > sure if it's possible to do that in a VL agnostic way. It >> > should be still possible to set up the SLP graph in a way that >> > a load-lane could be matched from SLP pattern recognition. >> >> Yeah, I don't think it would be possible to decompose a 3- or >> 5-lane grouped load into a series of VLA 2-input permutes. >> But (as I think you're saying) it seems like a load-3-lanes would just >> be a load with a LANE_PERMUTATION of N, N+3, N+6, N+9, ... for lane N. >> Is that right? > > Yes, that's how it looks without this patch. I think we'd need > a load node loading N, N+1, N+2, ... and then permute nodes > with N, N+3, ... and N+1, N+4, ... and N+2, N+5 ... so we generate > one .LOAD_LANES from the load node and the permutes pick up the > correct vector defs? I'm not sure yet how classification and > code generation would work for this. > > The store side is already on trunk with the single SLP store node > getting lanes via permutes. > > It might be we want a load/store node with N inputs/outputs as the > best representation and use lane_permutation to indicate the > input (for stores) and output (for loads) "permute". > >> > As said gaps are currently not handled - for SLP we have a >> > representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes" >> > would need to be filled in some way (even if we just push NULL). >> > >> > The patch misses multi-level even/odd handling as well as CSEing >> > intermediate generated permutes. Both is quite straight-forward >> > to add, but eventually there's a better or more general strategy >> > for lowering? The main goal of the patch is to avoid falling >> > back to non-SLP for cases the interleaving code handles. >> >> Does the multi-level thing including examples like: >> >> int a[2 * 16]; >> int b[8 * 16]; >> void f() >> { >> for (int i = 0; i < 16; ++i) >> { >> a[i * 2 + 0] += b[i * 8 + 0] + b[i * 8 + 1] + b[i * 8 + 2] + b[i * 8 + >> 3]; >> a[i * 2 + 1] += b[i * 8 + 4] + b[i * 8 + 5] + b[i * 8 + 6] + b[i * 8 + >> 7]; >> } >> } >> >> ? For that we generate: >> >> _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>; >> _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>; >> _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>; >> _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>; >> _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>; >> _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>; >> _53 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 1, 4, 5 }>; >> _52 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 1, 4, 5 }>; >> _51 = VEC_PERM_EXPR <_53, _52, { 1, 3, 5, 7 }>; >> _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>; >> >> (two even level 1, one even level 2, one odd level 1), whereas >> preferring 2xeven + 2xodd would avoid the third set of first-level >> permutes: >> >> _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>; >> _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>; >> _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>; >> _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>; >> _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>; >> _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>; >> _51 = VEC_PERM_EXPR <_45, _44, { 0, 2, 4, 6 }>; >> _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>; > > The multi-level issue is more when a single reduction to N/2 > inputs still doesn't get you to the point where you can do > the permute with two inputs. I think the above is more because > each load is handled individually, not taking into account > redundancies across loads when you have freedom in the > even/odd, level combinations (which you usually have). > > I suppose it should be possible to handle this to some extent, > not sure what the best strategy is when trying to avoid brute-force > searching for an optimal set (esp. when multi-level interleaving > will be involved).
Was wondering about picking the smaller of ctz_hwi (even) and ctz_hwi (odd) (when both are nonzero). I.e. something like: // HOST_BITS_PER_WIDE_INT for zero. auto even_bits = ctz_hwi (even); auto odd_bits = ctz_hwi (odd); if (even_bits <= odd_bits) { unsigned level = 1 << even_bits; gcc_assert (group_lanes % (2 * level) == 0); ... } else if (odd) { unsigned level = 1 << odd_bits; gcc_assert (group_lanes % (2 * level) == 0); ... } else gcc_unreachable (); Picking the smaller level should guarantee that it fits. That handles this case (and is how I generated the above), but maybe it's worse for something else. (And yeah, like you said in the follow-up, it's definitely better than what we generated for GCC 14. :) ) >> > Comments and suggestions welcome, esp. what representation >> > you'd think is suitable for SLP pattern matching to >> > load/store-lane and how to represent that? Maybe this lowering >> > should happen directly in vect_lower_load_permutations? >> >> If the load-lanes representation is as simple as above, it sounds like >> it could be deferred to pattern matching. Not sure what the result >> would look like though. It would be nice if (at least for costing >> purposes) we could have a single node for all lanes of the load-lanes, >> rather than create a separate node for each lane and rely on later CSE. >> (Or do we already have a good representation for this? It's been too >> long, sorry.) > > Yeah, as said above having a load-lane node with multiple outputs > would be the best match, similar for store-lane. It's probably > easiest to generate those right from this lowering until we > re-implement SLP discovery from scratch. OK. >> Bit of trivia below: >> >> > Thanks, >> > Richard. >> > >> > * tree-vect-slp.cc (vllp_cmp): New function. >> > (vect_lower_load_permutations): Likewise. >> > (vect_analyze_slp): Call it. >> > --- >> > gcc/tree-vect-slp.cc | 279 +++++++++++++++++++++++++++++++++++++++++++ >> > 1 file changed, 279 insertions(+) >> > >> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc >> > index 7e3d0107b4e..766b773452f 100644 >> > --- a/gcc/tree-vect-slp.cc >> > +++ b/gcc/tree-vect-slp.cc >> > @@ -3839,6 +3839,279 @@ vect_analyze_slp_instance (vec_info *vinfo, >> > return res; >> > } >> > >> > +/* qsort comparator ordering SLP load nodes. */ >> > + >> > +static int >> > +vllp_cmp (const void *a_, const void *b_) >> > +{ >> > + const slp_tree a = *(const slp_tree *)a_; >> > + const slp_tree b = *(const slp_tree *)b_; >> > + stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0]; >> > + stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0]; >> > + if (STMT_VINFO_GROUPED_ACCESS (a0) >> > + && STMT_VINFO_GROUPED_ACCESS (b0) >> > + && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0)) >> > + { >> > + /* Same group, order after lanes used. */ >> > + if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b)) >> > + return 1; >> > + else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b)) >> > + return -1; >> > + else >> > + { >> > + /* Try to order loads using the same lanes together, breaking >> > + the tie with the lane number that first differs. */ >> > + if (!SLP_TREE_LOAD_PERMUTATION (a).exists () >> > + && !SLP_TREE_LOAD_PERMUTATION (b).exists ()) >> > + return 0; >> >> Does the comparison need to be "stable", with a further tie-breaker >> when a != b? Or does our qsort not rely on that? > > It doesn't, there's stable_sort if we'd rely on a specific order > on the consumer side. > >> > + else if (SLP_TREE_LOAD_PERMUTATION (a).exists () >> > + && !SLP_TREE_LOAD_PERMUTATION (b).exists ()) >> > + return 1; >> > + else if (!SLP_TREE_LOAD_PERMUTATION (a).exists () >> > + && SLP_TREE_LOAD_PERMUTATION (b).exists ()) >> > + return -1; >> > + else >> > + { >> > + for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i) >> > + if (SLP_TREE_LOAD_PERMUTATION (a)[i] >> > + != SLP_TREE_LOAD_PERMUTATION (b)[i]) >> > + { >> > + /* In-order lane first, that's what the above case for >> > + no permutation does. */ >> > + if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i) >> > + return -1; >> > + else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i) >> > + return 1; >> > + else if (SLP_TREE_LOAD_PERMUTATION (a)[i] >> > + < SLP_TREE_LOAD_PERMUTATION (b)[i]) >> > + return -1; >> > + else >> > + return 1; >> > + } >> > + return 0; >> > + } >> > + } >> > + } >> > + else /* Different groups or non-groups. */ >> > + { >> > + /* Order groups as their first element to keep them together. */ >> > + if (STMT_VINFO_GROUPED_ACCESS (a0)) >> > + a0 = DR_GROUP_FIRST_ELEMENT (a0); >> > + if (STMT_VINFO_GROUPED_ACCESS (b0)) >> > + b0 = DR_GROUP_FIRST_ELEMENT (b0); >> > + if (a0 == b0) >> > + return 0; >> > + /* Tie using UID. */ >> > + else if (gimple_uid (STMT_VINFO_STMT (a0)) >> > + < gimple_uid (STMT_VINFO_STMT (b0))) >> > + return -1; >> > + else >> > + { >> > + gcc_assert (gimple_uid (STMT_VINFO_STMT (a0)) >> > + != gimple_uid (STMT_VINFO_STMT (b0))); >> > + return 1; >> > + } >> > + } >> > +} >> > + >> > +/* Process the set of LOADS that are all from the same dataref group. */ >> > + >> > +static void >> > +vect_lower_load_permutations (loop_vec_info loop_vinfo, >> > + scalar_stmts_to_slp_tree_map_t *bst_map, >> > + const array_slice<slp_tree> &loads) >> > +{ >> > + /* We at this point want to lower without a fixed VF or vector >> > + size in mind which means we cannot actually compute whether we >> > + need three or more vectors for a load permutation yet. So always >> > + lower. */ >> > + stmt_vec_info first >> > + = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]); >> > + >> > + /* ??? In principle we have to consider a gap up to the next full >> > + vector, but we have to actually represent a scalar stmt for the >> > + gaps value so delay handling this. The same is true for >> > + inbetween gaps which the load places in the load-permutation >> > + represent. It's probably not worth trying an intermediate packing >> > + to vectors without gap even if that might handle some more cases. >> > + Instead get the gap case correct in some way. */ >> > + unsigned group_lanes = 0; >> > + for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s)) >> > + { >> > + if ((s == first && DR_GROUP_GAP (s) != 0) >> > + || (s != first && DR_GROUP_GAP (s) != 1)) >> > + return; >> > + group_lanes++; >> > + } >> > + /* Only a power-of-two number of lanes matches interleaving. */ >> > + if (exact_log2 (group_lanes) == -1) >> > + return; >> > + >> > + for (slp_tree load : loads) >> > + { >> > + /* Leave masked or gather loads alone for now. */ >> > + if (!SLP_TREE_CHILDREN (load).is_empty ()) >> > + continue; >> > + >> > + /* We need to lower only loads of less than half of the groups >> > + lanes, including duplicate lanes. */ >> > + if (SLP_TREE_LANES (load) >= group_lanes / 2) >> > + continue; >> > + >> > + /* Lower by reducing the group to half its size using an >> > + interleaving scheme. For this try to compute whether all >> > + elements needed for this loads are in even or odd elements of >> >> this load (or these loads) >> >> > + a even/odd decomposition with N consecutive elements. >> >> an even/odd >> >> > + Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition >> > + with N == 2. */ >> > + unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1; >> >> Is this DR_GROUP_SIZE (first) conceptually different from group_lanes >> above? If not, I think it'd be a bit easier to follow if this line reused >> the exact_log2 result above. > > Once we look at groups with gaps it's different - the load permutation > lane indices have gaps represented, so a a[0], a[2], a[3] group > would have a load permutation of { 0, 2, 3 }. group_lanes is the > number of lanes in the output of the load which has unused/gap lanes > stripped. > > I've short-cut handling of groups with intermediate gaps and also with > gaps at the end for simplicity as I have to decide what to put into > SLP_TREE_SCALAR_STMTS for the unpermuted SLP load node which would > have those gaps "represented" (I'm quite sure a NULL ICEs left and > right, duplicating the previous lane sounds appealing even though > it's wrong ...). As said, this is more a proof-of-concept ;) > > For the next iteration I'm going to add some test coverage, esp. > also for the multi-level case and will see to handle gaps. OK, thanks, sounds good. Richard