Sorry for the slow reponse… Richard Biener <rguent...@suse.de> writes: > This splits SLP load nodes with load permutation into a SLP > load node (with load "permutation" removing gaps) and a > lane permutation node. The first and foremost goal of this > is to be able to have a single SLP node for each load group > so we can start making decisions about how to vectorize > them factoring in all loads of that group. The second > goal is to eventually be able to optimize permutations by > pushing them down where they can be combined from multiple > children to the output. We do have one very special case > handled by vect_attempt_slp_rearrange_stmts doing it all > the way down for reductions that are associative.
Sounds great! > For example for > > l1 = a[0]; l2 = a[1]; > b[0] = l1; b[1] = l2; > c[0] = l2; c[1] = l1; > > wecan avoid generating loads twice. For > > l1 = a[0]; l2 = a[1]; l3 = a[2]; > b[0] = l1; b[1] = l2; > c[0] = l2; c[1] = l3; > > we will have a SLP load node with three lanes plus > two lane permutation nodes selecting two lanes each. In > a loop context this will cause a VF of two and three > loads per vectorized loop iteration (plus two permutes) > while previously we used a VF of one with two loads > and no permutation per loop iteration. In the new > scheme the number of loads is less but we pay with > permutes and a higher VF. > > There is (bad) interaction with determining of the vectorization > factor which for BB vectorization causes missed vectorizations > because the "load parts of a dataref group directly" optimization > is not (yet) reflected in the SLP graph. > > There is (bad) interaction with CTOR vectorization since we now > get confused about where to insert vectorized stmts when combining > two previously distinct SLP graphs. > > My immediate focus is on the SSA verification FAILs but the > other part points at a missing piece in this - a "pass" > that "optimizes" the SLP graph with respect to permutations > and loads, ultimatively for example deciding between using > interleaving schemes, scalar loads, "SLP" + permutation, > load-lanes, etc.; This is also the thing that blocks > SLP only (apart from teaching the few pieces that cannot do SLP > to do SLP). > > I'm posting this mostly to make people think how it fits > their future development and architecture features. Yeah, the interleaving scheme is something we'd very much like for SVE2, where for some operations that produce multiple vectors, it would be better to organise them on an even/odd division instead of a low/high division. There are instructions both to produce and to consume values in odd/even form, so in many cases no explicit permutation would be needed. I've also seen cases for downward loops where we reverse vectors after loading and reverse them again before storing, even though the loop could easily have operated on the reversed vectors directly. Another thing we'd like for SVE in general is to allow vectors *with* gaps throughout the SLP graph, since with predication it's simple to ignore any inactive element where necessary. This is often much cheaper than packing the active elements together to remove gaps. For example: a[0] += 1; a[2] += 1; a[3] += 1; should just be a predicated load, add and store, with no shuffling. Even on targets without predication, it might be better to leave the gaps in for part of the graph, e.g. if two loads feed a single permute. So it would be good if the node representation allowed arbitrary permutations, including with “dead” elements at any position. (I realise that isn't news and that keeping the gap removal with the load was just “for now” :-)) I guess this to some extent feeds into a long-standing TODO to allow “don't care” elements in things like vec_perm_indices. Thanks, Richard