On 8/25/2022 7:07 AM, Richard Sandiford via Gcc-patches wrote:
Currently SLP tries to force permute operations "down" the graph
from loads in the hope of reducing the total number of permutations
needed or (in the best case) removing the need for the permutations
entirely. This patch tries to extend it as follows:
- Allow loads to take a different permutation from the one they
started with, rather than choosing between "original permutation"
and "no permutation".
- Allow changes in both directions, if the target supports the
reverse permutation.
- Treat the placement of permutations as a two-way dataflow problem:
after propagating information from leaves to roots (as now), propagate
information back up the graph.
- Take execution frequency into account when optimising for speed,
so that (for example) permutations inside loops have a higher
cost than permutations outside loops.
- Try to reduce the total number of permutations when optimising for
size, even if that increases the number of permutations on a given
execution path.
See the big block comment above vect_optimize_slp_pass for
a detailed description.
The original motivation for doing this was to add a framework that would
allow other layout differences in future. The two main ones are:
- Make it easier to represent predicated operations, including
predicated operations with gaps. E.g.:
a[0] += 1;
a[1] += 1;
a[3] += 1;
could be a single load/add/store for SVE. We could handle this
by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
(depending on what's being counted). We might need to move
elements between lanes at various points, like with permutes.
(This would first mean adding support for stores with gaps.)
- Make it easier to switch between an even/odd and unpermuted layout
when switching between wide and narrow elements. E.g. if a widening
operation produces an even vector and an odd vector, we should try
to keep operations on the wide elements in that order rather than
force them to be permuted back "in order".
Very cool. Richi and I discussed this a bit a year or so ago --
basically noting that bi-directional movement is really the way to go
and that the worst thing to do is push a permute down into the *middle*
of a computation chain since that will tend to break FMA generation.
Moving to the loads or stores or to another permute point ought to be
the goal.
It looks like you've covered that excessively well :-)
Jeff