On Tue, 3 Nov 2020, Tamar Christina wrote:

> Hi All,
> 
> This change allows one to simplify lane permutes that select from multiple 
> load
> leafs that load from the same DR group by promoting the VEC_PERM node into a
> load itself and pushing the lane permute into it as a load permute.
> 
> This saves us from having to calculate where to materialize a new load node.
> If the resulting loads are now unused they are freed and are removed from the
> graph.
> 
> This allows us to handle cases where we would have generated:
> 
>       movi    v4.4s, 0
>       adrp    x3, .LC0
>       ldr     q5, [x3, #:lo12:.LC0]
>       mov     x3, 0
>       .p2align 3,,7
> .L2:
>       mov     v0.16b, v4.16b
>       mov     v3.16b, v4.16b
>       ldr     q1, [x1, x3]
>       ldr     q2, [x0, x3]
>       fcmla   v0.4s, v2.4s, v1.4s, #0
>       fcmla   v3.4s, v1.4s, v2.4s, #0
>       fcmla   v0.4s, v2.4s, v1.4s, #270
>       fcmla   v3.4s, v1.4s, v2.4s, #270
>       mov     v1.16b, v3.16b
>       tbl     v0.16b, {v0.16b - v1.16b}, v5.16b
>       str     q0, [x2, x3]
>       add     x3, x3, 16
>       cmp     x3, 1600
>       bne     .L2
>       ret
> 
> and instead generate
> 
>       mov     x3, 0
>       .p2align 3,,7
> .L27:
>       ldr     q0, [x2, x3]
>       ldr     q1, [x0, x3]
>       ldr     q2, [x1, x3]
>       fcmla   v0.2d, v1.2d, v2.2d, #0
>       fcmla   v0.2d, v1.2d, v2.2d, #270
>       str     q0, [x2, x3]
>       add     x3, x3, 16
>       cmp     x3, 512
>       bne     .L27
>       ret
> 
> This runs as a pre step such that permute simplification can still inspect 
> this
> permute is needed
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Tests are included as part of the final patch as they need the SLP pattern
> matcher to insert permutes in between.
> 
> Ok for master?

So I think this is too specialized for the general issue that we're
doing a bad job in CSEing the load part of different permutes of
the same group.  I've played with fixing this half a year ago (again)
in multiple general ways but they all caused some regressions.

So you're now adding some heuristics as to when to anticipate
"CSE" (or merging with followup permutes).

To quickly recap what I did consider two loads (V2DF)
one { a[0], a[1] } and the other { a[1], a[0] }.  They
currently are two SLP nodes and one with a load_permutation.
My original attempts focused on trying to get rid of load_permutation
in favor of lane_permute nodes and thus during SLP discovery
I turned the second into { a[0], a[1] } (magically unified with
the other load) and a followup lane-permute node.

So for your case you have IIUC { a[0], a[0] } and { a[1], a[1] }
which eventually will (due to patterns) be lane-permuted
into { a[0], a[1] }, right?  So generalizing this as
a single { a[0], a[1] } plus two lane-permute nodes  { 0, 0 }
and { 1, 1 } early would solve the issue as well?  Now,
in general it might be more profitable to generate the
{ a[0], a[0] } and { a[1], a[1] } via scalar-load-and-splat
rather than vector load and permute so we have to be careful
to not over-optimize here or be prepared to do the reverse
transform.

The patch itself is a bit ugly since it modifies the SLP
graph when we already produced the graphds graph so I
would do any of this before.  I did consider gathering
all loads nodes loading from a group and then trying to
apply some heuristic to alter the SLP graph so it can
be better optimized.  In fact when we want to generate
the same code as the non-SLP interleaving scheme does
we do have to look at those since we have to unify
loads there.

I'd put this after vect_slp_build_vertices but before
the new_graph call - altering 'vertices' / 'leafs' should
be more easily possible and the 'leafs' array contains
all loads already (vect_slp_build_vertices could be massaged
to provide a map from DR_GROUP_FIRST_ELEMENT to slp_tree,
giving us the meta we want).

That said, I'd like to see something more forward-looking
rather than the ad-hoc special-casing of what you run into
with the pattern matching.

In case we want to still go with the special-casing it
should IMHO be done in a pre-order walk simply
looking for lane permute nodes with children that all
load from the same group performing what you do before
any of the vertices/graph stuff is built.  That's
probably easiest at this point and it can be done
when then bst_map is still around so you can properly
CSE the new load you build.

Thanks,
Richard.



> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       * tree-vect-slp.c (vect_optimize_slp): Promote permutes.
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imend

Reply via email to