RE: [PATCH v2 10/18]middle-end simplify lane permutes which selects from loads from the same DR.

Tamar Christina via Gcc-patches Wed, 04 Nov 2020 07:19:04 -0800

> -----Original Message-----
> From: rguent...@c653.arch.suse.de <rguent...@c653.arch.suse.de> On
> Behalf Of Richard Biener
> Sent: Wednesday, November 4, 2020 3:12 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; o...@ucw.cz
> Subject: RE: [PATCH v2 10/18]middle-end simplify lane permutes which
> selects from loads from the same DR.
> 
> On Wed, 4 Nov 2020, Tamar Christina wrote:
> 
> > Hi Richi,
> >
> > > -----Original Message-----
> > > From: rguent...@c653.arch.suse.de <rguent...@c653.arch.suse.de> On
> > > Behalf Of Richard Biener
> > > Sent: Wednesday, November 4, 2020 1:36 PM
> > > To: Tamar Christina <tamar.christ...@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; o...@ucw.cz
> > > Subject: Re: [PATCH v2 10/18]middle-end simplify lane permutes which
> > > selects from loads from the same DR.
> > >
> > > On Tue, 3 Nov 2020, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This change allows one to simplify lane permutes that select from
> > > > multiple load leafs that load from the same DR group by promoting
> > > > the VEC_PERM node into a load itself and pushing the lane permute
> > > > into it as a
> > > load permute.
> > > >
> > > > This saves us from having to calculate where to materialize a new load
> node.
> > > > If the resulting loads are now unused they are freed and are
> > > > removed from the graph.
> > > >
> > > > This allows us to handle cases where we would have generated:
> > > >
> > > >         movi    v4.4s, 0
> > > >         adrp    x3, .LC0
> > > >         ldr     q5, [x3, #:lo12:.LC0]
> > > >         mov     x3, 0
> > > >         .p2align 3,,7
> > > > .L2:
> > > >         mov     v0.16b, v4.16b
> > > >         mov     v3.16b, v4.16b
> > > >         ldr     q1, [x1, x3]
> > > >         ldr     q2, [x0, x3]
> > > >         fcmla   v0.4s, v2.4s, v1.4s, #0
> > > >         fcmla   v3.4s, v1.4s, v2.4s, #0
> > > >         fcmla   v0.4s, v2.4s, v1.4s, #270
> > > >         fcmla   v3.4s, v1.4s, v2.4s, #270
> > > >         mov     v1.16b, v3.16b
> > > >         tbl     v0.16b, {v0.16b - v1.16b}, v5.16b
> > > >         str     q0, [x2, x3]
> > > >         add     x3, x3, 16
> > > >         cmp     x3, 1600
> > > >         bne     .L2
> > > >         ret
> > > >
> > > > and instead generate
> > > >
> > > >         mov     x3, 0
> > > >         .p2align 3,,7
> > > > .L27:
> > > >         ldr     q0, [x2, x3]
> > > >         ldr     q1, [x0, x3]
> > > >         ldr     q2, [x1, x3]
> > > >         fcmla   v0.2d, v1.2d, v2.2d, #0
> > > >         fcmla   v0.2d, v1.2d, v2.2d, #270
> > > >         str     q0, [x2, x3]
> > > >         add     x3, x3, 16
> > > >         cmp     x3, 512
> > > >         bne     .L27
> > > >         ret
> > > >
> > > > This runs as a pre step such that permute simplification can still
> > > > inspect this permute is needed
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > > Tests are included as part of the final patch as they need the SLP
> > > > pattern matcher to insert permutes in between.
> > > >
> > > > Ok for master?
> > >
> > > So I think this is too specialized for the general issue that we're
> > > doing a bad job in CSEing the load part of different permutes of the
> > > same group.  I've played with fixing this half a year ago (again) in
> > > multiple general ways but they all caused some regressions.
> > >
> > > So you're now adding some heuristics as to when to anticipate "CSE"
> > > (or merging with followup permutes).
> > >
> > > To quickly recap what I did consider two loads (V2DF) one { a[0],
> > > a[1] } and the other { a[1], a[0] }.  They currently are two SLP
> > > nodes and one with a load_permutation.
> > > My original attempts focused on trying to get rid of
> > > load_permutation in favor of lane_permute nodes and thus during SLP
> > > discovery I turned the second into { a[0], a[1] } (magically unified
> > > with the other load) and a followup lane-permute node.
> > >
> > > So for your case you have IIUC { a[0], a[0] } and { a[1], a[1] }
> > > which eventually will (due to patterns) be lane-permuted into {
> > > a[0], a[1] }, right?  So generalizing this as a single { a[0], a[1]
> > > } plus two lane-permute nodes  { 0, 0 } and { 1, 1 } early would solve the
> issue as well?
> >
> > Correct, I did wonder why it was generating two different nodes
> > instead of a lane permute but didn't pay much attention that it was just a
> short coming.
> >
> > > Now, in general it might be
> > > more profitable to generate the { a[0], a[0] } and { a[1], a[1] }
> > > via scalar-load- and-splat rather than vector load and permute so we
> > > have to be careful to not over-optimize here or be prepared to do the
> reverse transform.
> >
> > This in principle can be done in optimize_slp then right? Since it
> > would do a lot of the same work already and find the materialization points.
> >
> > >
> > > The patch itself is a bit ugly since it modifies the SLP graph when
> > > we already produced the graphds graph so I would do any of this
> > > before.  I did consider gathering all loads nodes loading from a
> > > group and then trying to apply some heuristic to alter the SLP graph
> > > so it can be better optimized.  In fact when we want to generate the
> > > same code as the non-SLP interleaving scheme does we do have to look
> at those since we have to unify loads there.
> > >
> >
> > Yes.. I will concede the patch isn't my finest work.. I also don't
> > like the fact that I had to keep leafs in tact less I break things
> > later. But wanted feedback :)
> >
> > > I'd put this after vect_slp_build_vertices but before the new_graph
> > > call - altering 'vertices' / 'leafs' should be more easily possible
> > > and the 'leafs' array contains all loads already
> > > (vect_slp_build_vertices could be massaged to provide a map from
> > > DR_GROUP_FIRST_ELEMENT to slp_tree, giving us the meta we want).
> > >
> > > That said, I'd like to see something more forward-looking rather
> > > than the ad- hoc special-casing of what you run into with the pattern
> matching.
> > >
> >
> > Yeah, I like your suggestion about doing it at build time and CSEing
> > early, but don't think I can get that work in a week given that you've
> > already tried multiple times :) Happy to give it a go next stage-1 opening
> though.
> >
> > > In case we want to still go with the special-casing it should IMHO
> > > be done in a pre-order walk simply looking for lane permute nodes
> > > with children that all load from the same group performing what you
> > > do before any of the vertices/graph stuff is built.  That's probably
> > > easiest at this point and it can be done when then bst_map is still
> > > around so you can properly CSE the new load you build.
> >
> > That's fair enough. I do think I need a temporary (not terrible)
> > workaround...This would then need to be somewhere in vect_analyze_slp.
> > Would you prefer I do it during the construction of the instance of
> afterwards?
> 
> Well, right after pattern recog processed all instances - the issue only pops 
> up
> due to pattern recog, right?

Ah true *facepalm* though I'll need to refactor slightly to keep bst_map around 
a bit longer.

Regards,
Tamar

> 
> > Regards,
> > Tamar
> >
> > >
> > > Thanks,
> > > Richard.
> > >
> > >
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * tree-vect-slp.c (vect_optimize_slp): Promote permutes.
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguent...@suse.de>
> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > > Nuernberg, Germany; GF: Felix Imend
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Felix Imend
RE: [PATCH v2 10/18]middle-end simplify lane permutes which selects from loads from the same DR.

Reply via email to