On 8/30/2022 8:50 AM, Richard Sandiford wrote:
Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
On 8/25/2022 7:07 AM, Richard Sandiford via Gcc-patches wrote:
Currently SLP tries to force permute operations "down" the graph
from loads in the hope of reducing the total number of permutations
needed or (in the best case) removing the need for the permutations
entirely.  This patch tries to extend it as follows:

- Allow loads to take a different permutation from the one they
    started with, rather than choosing between "original permutation"
    and "no permutation".

- Allow changes in both directions, if the target supports the
    reverse permutation.

- Treat the placement of permutations as a two-way dataflow problem:
    after propagating information from leaves to roots (as now), propagate
    information back up the graph.

- Take execution frequency into account when optimising for speed,
    so that (for example) permutations inside loops have a higher
    cost than permutations outside loops.

- Try to reduce the total number of permutations when optimising for
    size, even if that increases the number of permutations on a given
    execution path.

See the big block comment above vect_optimize_slp_pass for
a detailed description.

The original motivation for doing this was to add a framework that would
allow other layout differences in future.  The two main ones are:

- Make it easier to represent predicated operations, including
    predicated operations with gaps.  E.g.:

       a[0] += 1;
       a[1] += 1;
       a[3] += 1;

    could be a single load/add/store for SVE.  We could handle this
    by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
    (depending on what's being counted).  We might need to move
    elements between lanes at various points, like with permutes.

    (This would first mean adding support for stores with gaps.)

- Make it easier to switch between an even/odd and unpermuted layout
    when switching between wide and narrow elements.  E.g. if a widening
    operation produces an even vector and an odd vector, we should try
    to keep operations on the wide elements in that order rather than
    force them to be permuted back "in order".
Very cool.  Richi and I discussed this a bit a year or so ago --
basically noting that bi-directional movement is really the way to go
and that the worst thing to do is push a permute down into the *middle*
of a computation chain since that will tend to break FMA generation.
Moving to the loads or stores or to another permute point ought to be
the goal.
Hmm, I hadn't thought specifically about the case of permutes
ending up in the middle of a fusable operation.  The series doesn't
address that directly.  If we have:

   permute(a) * permute(b) + c

then the tendency will still be to convert that into:

   permute(a * b) + c

Damn.  Another case to think about ;-)

I've pushed the series for now though (thanks to Richi for the reviews).
There's a simple testcase attached to PR101895 which shows an example where (in the gcc-11 era) we pushed a permute down in a problematic way.   It might be worth taking a looksie, though I think we're avoiding the problem behavior via a workaround on the trunk right now.

Jeff

Reply via email to