> -----Original Message----- > From: rguent...@c653.arch.suse.de <rguent...@c653.arch.suse.de> On > Behalf Of Richard Biener > Sent: Wednesday, November 4, 2020 3:12 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; o...@ucw.cz > Subject: RE: [PATCH v2 10/18]middle-end simplify lane permutes which > selects from loads from the same DR. > > On Wed, 4 Nov 2020, Tamar Christina wrote: > > > Hi Richi, > > > > > -----Original Message----- > > > From: rguent...@c653.arch.suse.de <rguent...@c653.arch.suse.de> On > > > Behalf Of Richard Biener > > > Sent: Wednesday, November 4, 2020 1:36 PM > > > To: Tamar Christina <tamar.christ...@arm.com> > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; o...@ucw.cz > > > Subject: Re: [PATCH v2 10/18]middle-end simplify lane permutes which > > > selects from loads from the same DR. > > > > > > On Tue, 3 Nov 2020, Tamar Christina wrote: > > > > > > > Hi All, > > > > > > > > This change allows one to simplify lane permutes that select from > > > > multiple load leafs that load from the same DR group by promoting > > > > the VEC_PERM node into a load itself and pushing the lane permute > > > > into it as a > > > load permute. > > > > > > > > This saves us from having to calculate where to materialize a new load > node. > > > > If the resulting loads are now unused they are freed and are > > > > removed from the graph. > > > > > > > > This allows us to handle cases where we would have generated: > > > > > > > > movi v4.4s, 0 > > > > adrp x3, .LC0 > > > > ldr q5, [x3, #:lo12:.LC0] > > > > mov x3, 0 > > > > .p2align 3,,7 > > > > .L2: > > > > mov v0.16b, v4.16b > > > > mov v3.16b, v4.16b > > > > ldr q1, [x1, x3] > > > > ldr q2, [x0, x3] > > > > fcmla v0.4s, v2.4s, v1.4s, #0 > > > > fcmla v3.4s, v1.4s, v2.4s, #0 > > > > fcmla v0.4s, v2.4s, v1.4s, #270 > > > > fcmla v3.4s, v1.4s, v2.4s, #270 > > > > mov v1.16b, v3.16b > > > > tbl v0.16b, {v0.16b - v1.16b}, v5.16b > > > > str q0, [x2, x3] > > > > add x3, x3, 16 > > > > cmp x3, 1600 > > > > bne .L2 > > > > ret > > > > > > > > and instead generate > > > > > > > > mov x3, 0 > > > > .p2align 3,,7 > > > > .L27: > > > > ldr q0, [x2, x3] > > > > ldr q1, [x0, x3] > > > > ldr q2, [x1, x3] > > > > fcmla v0.2d, v1.2d, v2.2d, #0 > > > > fcmla v0.2d, v1.2d, v2.2d, #270 > > > > str q0, [x2, x3] > > > > add x3, x3, 16 > > > > cmp x3, 512 > > > > bne .L27 > > > > ret > > > > > > > > This runs as a pre step such that permute simplification can still > > > > inspect this permute is needed > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > Tests are included as part of the final patch as they need the SLP > > > > pattern matcher to insert permutes in between. > > > > > > > > Ok for master? > > > > > > So I think this is too specialized for the general issue that we're > > > doing a bad job in CSEing the load part of different permutes of the > > > same group. I've played with fixing this half a year ago (again) in > > > multiple general ways but they all caused some regressions. > > > > > > So you're now adding some heuristics as to when to anticipate "CSE" > > > (or merging with followup permutes). > > > > > > To quickly recap what I did consider two loads (V2DF) one { a[0], > > > a[1] } and the other { a[1], a[0] }. They currently are two SLP > > > nodes and one with a load_permutation. > > > My original attempts focused on trying to get rid of > > > load_permutation in favor of lane_permute nodes and thus during SLP > > > discovery I turned the second into { a[0], a[1] } (magically unified > > > with the other load) and a followup lane-permute node. > > > > > > So for your case you have IIUC { a[0], a[0] } and { a[1], a[1] } > > > which eventually will (due to patterns) be lane-permuted into { > > > a[0], a[1] }, right? So generalizing this as a single { a[0], a[1] > > > } plus two lane-permute nodes { 0, 0 } and { 1, 1 } early would solve the > issue as well? > > > > Correct, I did wonder why it was generating two different nodes > > instead of a lane permute but didn't pay much attention that it was just a > short coming. > > > > > Now, in general it might be > > > more profitable to generate the { a[0], a[0] } and { a[1], a[1] } > > > via scalar-load- and-splat rather than vector load and permute so we > > > have to be careful to not over-optimize here or be prepared to do the > reverse transform. > > > > This in principle can be done in optimize_slp then right? Since it > > would do a lot of the same work already and find the materialization points. > > > > > > > > The patch itself is a bit ugly since it modifies the SLP graph when > > > we already produced the graphds graph so I would do any of this > > > before. I did consider gathering all loads nodes loading from a > > > group and then trying to apply some heuristic to alter the SLP graph > > > so it can be better optimized. In fact when we want to generate the > > > same code as the non-SLP interleaving scheme does we do have to look > at those since we have to unify loads there. > > > > > > > Yes.. I will concede the patch isn't my finest work.. I also don't > > like the fact that I had to keep leafs in tact less I break things > > later. But wanted feedback :) > > > > > I'd put this after vect_slp_build_vertices but before the new_graph > > > call - altering 'vertices' / 'leafs' should be more easily possible > > > and the 'leafs' array contains all loads already > > > (vect_slp_build_vertices could be massaged to provide a map from > > > DR_GROUP_FIRST_ELEMENT to slp_tree, giving us the meta we want). > > > > > > That said, I'd like to see something more forward-looking rather > > > than the ad- hoc special-casing of what you run into with the pattern > matching. > > > > > > > Yeah, I like your suggestion about doing it at build time and CSEing > > early, but don't think I can get that work in a week given that you've > > already tried multiple times :) Happy to give it a go next stage-1 opening > though. > > > > > In case we want to still go with the special-casing it should IMHO > > > be done in a pre-order walk simply looking for lane permute nodes > > > with children that all load from the same group performing what you > > > do before any of the vertices/graph stuff is built. That's probably > > > easiest at this point and it can be done when then bst_map is still > > > around so you can properly CSE the new load you build. > > > > That's fair enough. I do think I need a temporary (not terrible) > > workaround...This would then need to be somewhere in vect_analyze_slp. > > Would you prefer I do it during the construction of the instance of > afterwards? > > Well, right after pattern recog processed all instances - the issue only pops > up > due to pattern recog, right? Ah true *facepalm* though I'll need to refactor slightly to keep bst_map around a bit longer. Regards, Tamar > > > Regards, > > Tamar > > > > > > > > Thanks, > > > Richard. > > > > > > > > > > > > > Thanks, > > > > Tamar > > > > > > > > gcc/ChangeLog: > > > > > > > > * tree-vect-slp.c (vect_optimize_slp): Promote permutes. > > > > > > > > > > > > > > -- > > > Richard Biener <rguent...@suse.de> > > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > > > Nuernberg, Germany; GF: Felix Imend > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > Nuernberg, Germany; GF: Felix Imend
RE: [PATCH v2 10/18]middle-end simplify lane permutes which selects from loads from the same DR.
Tamar Christina via Gcc-patches Wed, 04 Nov 2020 07:19:04 -0800
- [PATCH v2 10/18]middle-end simplify lane p... Tamar Christina via Gcc-patches
- Re: [PATCH v2 10/18]middle-end simpli... Richard Biener
- RE: [PATCH v2 10/18]middle-end si... Tamar Christina via Gcc-patches
- RE: [PATCH v2 10/18]middle-en... Richard Biener
- RE: [PATCH v2 10/18]middl... Tamar Christina via Gcc-patches