RE: [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP recognizer for Complex addition with rotate and complex MLA with rotation

Richard Biener Thu, 15 Nov 2018 05:02:17 -0800

On Wed, 14 Nov 2018, Tamar Christina wrote:

> Hi Richard,
> 
> Thanks for the feedback, I've replied inline below.
> I'll wait for your answers before making changes.


I have commented on the other mail so will leave out redunant parts
here.

> > -----Original Message-----
> > From: Richard Biener <richard.guent...@gmail.com>
> > Sent: Wednesday, November 14, 2018 12:21
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; nd <n...@arm.com>; Richard
> > Guenther <rguent...@suse.de>; Zdenek Dvorak <o...@ucw.cz>; Richard
> > Earnshaw <richard.earns...@arm.com>; James Greenhalgh
> > <james.greenha...@arm.com>; Marcus Shawcroft
> > <marcus.shawcr...@arm.com>
> > Subject: Re: [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP
> > recognizer for Complex addition with rotate and complex MLA with rotation
> > 
> > On Sun, Nov 11, 2018 at 11:26 AM Tamar Christina
> > <tamar.christ...@arm.com> wrote:
> > >
> > > Hi All,
> > >
> > > This patch adds support for SLP vectorization of Complex number
> > > arithmetic with rotations along with Argand plane.
> > >
> > > For this to work it has to recognize two statements in parallel as it
> > > needs to match against operations towards both the real and imaginary
> > > numbers.  The instructions generated by this change is also only
> > > available in their vector forms.  As such I add them as pattern 
> > > statements to
> > the stmt.  The BB is left untouched and so the scalar loop is untouched.
> > >
> > > The instructions also require the loads to be contiguous and so when a
> > > match is made, and the code decides it is able to do the replacement
> > > it re-organizes the SLP tree such that the loads are contiguous.
> > > Since this operation cannot be undone it only does this if it's sure that 
> > > the
> > resulting loads can be made continguous.
> > >
> > > It gets this guarantee by only allowing the replacement if between the
> > > matched expression and the loads there are no other expressions it
> > doesn't know aside from type casts.
> > >
> > > When a match occurs over multiple expressions, the dead statements are
> > > immediately removed from the tree to prevent verification failures later.
> > >
> > > Because the pattern matching is done after SLP analysis has analyzed
> > > the usage of the instruction it also marks the instructions as used and 
> > > the
> > old ones as unusued.
> > >
> > > When a replacement is done a new internal function is generated which
> > > the back-end has to expand to the proper instruction sequences.
> > >
> > > For now, this patch only adds support for Complex addition with rotate
> > > and Complex FMLA with rotation of 0 and 180. However it is the
> > > intention to in the future add support for Complex subtraction and
> > Complex multiplication.
> > >
> > > Concretely, this generates
> > >
> > >         ldr     q1, [x1, x3]
> > >         ldr     q2, [x0, x3]
> > >         ldr     q0, [x2, x3]
> > >         fcmla   v0.2d, v1.2d, v2.2d, #180
> > >         fcmla   v0.2d, v1.2d, v2.2d, #90
> > >         str     q0, [x2, x3]
> > >         add     x3, x3, 16
> > >         cmp     x3, 3200
> > >         bne     .L2
> > >         ret
> > >
> > > now instead of
> > >
> > >         add     x3, x2, 31
> > >         sub     x4, x3, x1
> > >         sub     x3, x3, x0
> > >         cmp     x4, 62
> > >         mov     x4, 62
> > >         ccmp    x3, x4, 0, hi
> > >         bls     .L5
> > >         mov     x3, x0
> > >         mov     x0, x1
> > >         add     x1, x2, 3200
> > >         .p2align 3,,7
> > > .L3:
> > >         ld2     {v16.2d - v17.2d}, [x2]
> > >         ld2     {v2.2d - v3.2d}, [x3], 32
> > >         ld2     {v0.2d - v1.2d}, [x0], 32
> > >         mov     v7.16b, v17.16b
> > >         fmul    v6.2d, v0.2d, v3.2d
> > >         fmla    v7.2d, v1.2d, v3.2d
> > >         fmla    v6.2d, v1.2d, v2.2d
> > >         fmls    v7.2d, v2.2d, v0.2d
> > >         fadd    v4.2d, v6.2d, v16.2d
> > >         mov     v5.16b, v7.16b
> > >         st2     {v4.2d - v5.2d}, [x2], 32
> > >         cmp     x2, x1
> > >         bne     .L3
> > >         ret
> > > .L5:
> > >         add     x4, x2, 8
> > >         add     x6, x0, 8
> > >         add     x5, x1, 8
> > >         mov     x3, 0
> > >         .p2align 3,,7
> > > .L2:
> > >         ldr     d1, [x6, x3]
> > >         ldr     d4, [x1, x3]
> > >         ldr     d5, [x5, x3]
> > >         ldr     d3, [x0, x3]
> > >         fmul    d2, d4, d1
> > >         ldr     d0, [x4, x3]
> > >         fmadd   d0, d5, d1, d0
> > >         ldr     d1, [x2, x3]
> > >         fmadd   d2, d5, d3, d2
> > >         fmsub   d0, d4, d3, d0
> > >         fadd    d1, d1, d2
> > >         str     d1, [x2, x3]
> > >         str     d0, [x4, x3]
> > >         add     x3, x3, 16
> > >         cmp     x3, 3200
> > >         bne     .L2
> > >         ret
> > >
> > > Bootstrap and Regtests on aarch64-none-linux-gnu, arm-none-gnueabihf
> > > and x86_64-pc-linux-gnu are still on going but previous patch showed no
> > regressions.
> > >
> > > The instructions have also been tested on aarch64-none-elf and
> > > arm-none-eabi on a Armv8.3-a model and -march=Armv8.3-a+fp16 and all
> > tests pass.
> > >
> > > Ok for trunk?
> > 
> > I first have a few high-level questions.  Complex addition when the complex
> > values are in vectors looks trivial to me and maps to vector addition.  Your
> > md.texi description of fcadd mentions a rotation 'm' but doesn't further
> > explain the details
> > - I suppose
> > fcadd@var{m}@var{n}3 really means fcadd90@var{n}3 and
> > fcadd270@var{n}3, thus the rotation being encoded into the pattern name
> > rather than having a mode m and an explicit operand?  If that is so please 
> > list
> > those patterns explicitely and separately.
> 
> Yes that's correct, I'll list them separately. 
> 
> > Then I'm not sure why the vectorizer needs to be bothered with this?  Can
> > the instructions not be recognized by combine from the "rotate" and the
> > vector add?
> > That is, what happens if the user writes generic vector code for this?
> 
> The reason I'm doing this in the vectorizer is that the vectorizer has 
> determined that it needs to load the values using a LOAD_LANES because of how 
> complex numbers are stored in pair.
> 
> You'd have real,imag,real,imag,real,imag in v1 and v2. And in order for it to 
> vectorize the operation
> 
> a * b * I which becomes 
> 
> real3 = real1 - imag2
> imag3 = imag1 + real2
> 
> It has to have vectors of only the real parts and vectors of only the 
> imaginary parts.
> 
> However the new instructions expect the original layout of v1 and v2, not the 
> permuted ones created by GCC thinking it needs a LOAD_LANES.
> This is why I re-order the loads, to cancel out the permute. The permute is 
> created because of the complex lowering and the operations on the lowered 
> complex number's components. The rotation cause operations to be performed on 
> the real component of one complex number  and imaginary component of the 
> other.
> 
> > 
> > Can you also explicitely spell out what "rotation by 90 or 270" means for 
> > the
> > operation?  If I decipher the text OK then only one operand (operand 2) is
> > rotated and operand 1 is unchanged?  Or is the result rotated instead?
> > 
> 
> Yes that's correct, I'll clarify that, it's only one operand that is rotated, 
> one of the inputs.
> 
> > So it looks like the patch adds a general facility to recognize patterns on 
> > the
> > SLP graph after SLP discovery.  This complicates matters a lot it seems.
> > Comments (unordered) about the actual patch:
> > 
> > +static slp_tree
> > +vect_match_slp_patterns_1 (slp_tree node, vec_info *vinfo,
> > +                          unsigned int group_size, complex_pattern_t
> > +patt_info) {
> > +  int i;
> > +  stmt_vec_info stmt_info;
> > +  hash_set<stmt_vec_info> exempt;
> > +  auto_vec<tree> v1, v2;
> > +  vec<stmt_vec_info> scalar_stmts = SLP_TREE_SCALAR_STMTS (node);
> > +  bool keep_looking = false;
> > +
> > +  if (group_size != 2)
> > +    return node;
> > 
> > this seems quite arbitrary given once the user unrolls the loops you'll be
> > faced with two complex operations.  Did you at least think about how to
> > generalize this without trying all permutations?  
> > 
> 
> Hmm no I hadn't considered the unrolling case much yet, but it does indeed 
> present and issue with matching.
> An approach that could work is doing a light pass over all the operations and 
> ordering them based on how the order/location
> of the first operand in in the load chain.  Since the first vector should be 
> getting used sequentially. From this any two pairs of operations
> with sequential even and odd locations in the load chain are possible 
> combinations.
> 
> > What about vector ISAs
> > where two or more complex numbers fit inside a vector? (SVE?)
> 
> We already have that with NEON, a V4SF will contain two complex numbers, and 
> a V8HF four. My expectation is that SVE should just work
> as the operations have the same semantics for it, It'll just have to use one 
> of the sizeless vector types, but since SLP vectorization doesn't seem to 
> happen atm for this sequence I couldn't test it yet.
> 
> > +  /* If we haven't matched anything, look deeper.  */  if
> > + (keep_looking)
> > +    {
> > +      slp_tree child;
> > +      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> > +       SLP_TREE_CHILDREN (node)[i]
> > +         = vect_match_slp_patterns_1 (child, vinfo, group_size, patt_info);
> > +    }
> > 
> > usually pattern matching should work from defs to uses, thus recurse first?
> > And,
> 
> Ah yes, I'll change that.
> 
> > 
> > +      keep_looking = gimple_assign_cast_p (stmt_0)
> > +                    || gimple_assign_load_p (stmt_0)
> > +                    || gimple_store_p (stmt_0);
> > 
> > loads will never have any SLP children.  Stores are always the very first 
> > SLP
> > node only.  This effectively means you never look deeper than the very first
> > arithmetic operation, just skipping an eventual widening/shortening?!
> 
> Yes, this is a restriction I have put in place for the first version for 
> support of this.
> The reason is that when I only have complex add, it would match the tree 
> partially
> multiple times, and I hadn't convinced myself at the time that this was 
> correct or beneficial.
> 
> I did want to revisit it, as my plans for the next version (GCC-10) would be 
> to recognize add, subtract and multiply
> and use combine to recognize the multiply and add.
> 
> > +         /* Now push new statements.  */
> > +         SLP_TREE_SCALAR_STMTS (node).truncate (0);
> > +
> > +         gcall *call_stmt_0
> > +           = vect_create_complex_patt_stmt (node, opstmt_info_0, ifn,
> > +                                            v1, vinfo, type, vectype);
> > +         gcall *call_stmt_1
> > +           = vect_create_complex_patt_stmt (node, opstmt_info_1, ifn,
> > +                                            v2, vinfo, type, vectype);
> > 
> > do not perform the SLP tree modification in those helpers, that looks odd.
> > In that function I see you do sth like
> > 
> > +  vect_mark_pattern_stmts (stmt_info, call_stmt, vectype);
> > + STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
> > + STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope;
> > 
> > I don't think that will fly given that the decision on whether the stmt is 
> > used
> > only by SLP is done only later (hybrid SLP).  Usually patterns allow the 
> > original
> > pieces still to be used, not sure why you try to be clever and not allow 
> > this.
> > They might end up being used by other SLP instances as well which are either
> > already detected or only will be discovered later.
> 
> True, the intention is  that the tree created is only a pure SLP tree and I 
> explicitly mark the statements as required because of the changing of the 
> load permute.  If the original operations were to be used inside the SLP tree 
> then the result will be wrong. Each operation
> would no longer work on groups of real/imag numbers but instead pairs of 
> real+imag numbers.
> 
> Also not marking the statements as used ends up having the 
> vectorizable_operation call determine that they're irrelevant and are just 
> dropped from the output entirely.
> 
> > 
> > +         /* Re-order the loads, first in the SLP tree.  */
> > +         vect_delete_slp_nodes (&node, node);
> > +         vect_balance_load_nodes (node, group_size);
> > +         SLP_TREE_TWO_OPERATORS (node) = false;
> > 
> > +static bool
> > +vect_is_complex_transform_valid (slp_tree node, hash_set<stmt_vec_info>
> > +exempt) {
> > 
> > this function doesn't look generic - why's it not simply called from the 
> > pattern
> > recognition function?  I also do not understand why there cannot be any
> > operations between the complex add and the loads?
> > 
> 
> It is only called from vect_match_slp_patterns_1. 
> 
> > I also do not understand why there cannot be any
> > operations between the complex add and the loads?
> 
> As I mentioned above, I was being conservative here. My fear was that 
> partially replacing instructions and undoing the permutes may give the wrong 
> results if there's an instruction left over that was expecting the permute to 
> be there, I've expanded on this below.
> 
> > IMHO vect_delete_slp_nodes_1 shouldn't do any stmt removal (there's
> > nothing to DSE, only intermediate scalar operations become dead).
> 
> The problem is it doesn't see them as dead. In the end it still creates 
> vector statements for them and then it'll ICE later because (from what I 
> can gather) the reordering of the loads makes it create constant vectors 
> (vector_cst_) so the verification fails as (I think) it hasn't seen all 
> of the definitions in the order it expected for those instructions (e.g. 
> definition before use).

I think you're doing sth wrong then and this papers over the issue.

> > Again you are hard-coding knowledge of the specific patterns in this 
> > function,
> > instead the pattern recognition function should deal with this.
> > Note the SLP tree is now a graph so there can be multiple uses of a node.
> 
> I could unlink the node form the matched parent instead of deleting it if it 
> has more than one usage.
> 
> > Overall I'm not convinced the SLP pattern matching will work in the way you
> > implemented it.  At least it has to be moved later until after
> > vect_detect_hybrid_slp where we know whether stmts are also used by
> > non-SLP parts and when we have discovered all SLP instances.
> > 
> 
> The reason I haven't done this after vect_detect_hybrid is because 
> vect_analyze_slp_instance is where SLP_TREE_LOAD_PERMUTATION and it 
> decides to cancel the SLP build if load/store lanes are supported. A 
> critical part of this whole thing is that the permute should no longer 
> be done.

Yes, but you can't do what you do before vect_detect_hybrid ...

See the other mail where I point out that store/load-lane behavior.

I don't think you can have it in a way that you know whether you
have pure or hybrid SLP during pattern detection.  That means it
has to work for both cases.

> > Then the pattern detection needs to apply more broadly, thus you should
> > relax the constraints you put on earlier and later operations.
> > 
> 
> I can do that, like I said they were just very conservative, I think 
> they should work and I can relax most of them.

At least the infrastructure should be written in a way that I can plug
in different pattern recognizers doing sth completely different
which means putting almost all logic into the callbacks.

> > I'd not do any of the tieing to loads or order of loads - as far as I 
> > understand
> > the instructions do not perform loads themselves.
> 
> No, but they do require a particular order of the values in the 
> registers. As before, the normal statements would have created these 
> permutes so that they work on grouping of the parts of the complex 
> numbers. e.g. subtract a vector of the real parts from a vector of 
> imaginary parts.
> 
> The new instructions require the vectors to be their original 
> alternating sequence of real/imag.  So you do have to change the order, 
> this is why you gain so much in the generated code sequence.

You simply have to refrain from using load/store-lanes.

> > And your pictures of the rotation effect only mentions changed operations,
> > not shuffles.  That is, optimizing shuffles should be done by combine later 
> > (or
> > by the much sought SLP shuffle optimization).
> > 
> > For the add you probably can get it all done by combine.
> 
> Sure, but add is the exceptional case here, and it's just at  the limits of 
> combine, requiring you to match two loads, two instructions and a store.
> 
> Matching just the addition and subtraction won't be correct, as I've 
> mentioned above. Multiplication and FMA will both not work on combine. It 
> would be much nicer to handle them all in the same place instead of spread 
> around in target specific combine patterns.
> 
> > 
> > For the FMA did you think about handling it in regular pattern recognition
> > instead?  Your matching is so constrained that this shouldn't be too 
> > difficult.
> 
> The regular pattern matching has two issues, 1) I have to match against 
> two statements at the same time, and replace both. As far as I can tell, 
> while the regular pattern matching would allow you to match against 
> multiple statements by walking the BB, you can only return one pattern 
> from the matcher to replace the currently being scrutinized statement.
> 
> You need to replace both statements and have the operands in an order 
> where it will realise that it can combine them into one vector 
> instruction instead of trying to unroll the loop to create the vector.

True.

> 2) Secondly, the regular pattern matching doesn't allow me to change the 
> loads, if I end up with a permute, the instruction is going to permute 
> the values again. The answer would be incorrect.

I guess given 1) this is irrelevant.  But yes, the whole point is
to _not_ vectorize using load/store-lanes.  I think that is what you
need to fix first, for example by

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index f2bb8da9de2..9e91be28c7c 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -2067,6 +2067,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
              && dr && vect_store_lanes_supported (vectype, group_size, 
false))
            {
              slp_tree load_node;
+             bool all_complex_rotates = true;
              FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (new_instance), i, 
load_node)
                {
                  stmt_vec_info stmt_vinfo = DR_GROUP_FIRST_ELEMENT
@@ -2077,8 +2078,12 @@ vect_analyze_slp_instance (vec_info *vinfo,
                      (STMT_VINFO_VECTYPE (stmt_vinfo),
                       DR_GROUP_SIZE (stmt_vinfo), false))
                    break;
+                 if (SLP_TREE_LOAD_PERMUTATION (load_node)
+                     && !matches_complex_rotate 
(SLP_TREE_LOAD_PERMUTATION (load_node)))
+                   all_complex_rotates = false;
                }
-             if (i == SLP_INSTANCE_LOADS (new_instance).length ())
+             if (!all_complex_rotates
+                 && i == SLP_INSTANCE_LOADS (new_instance).length ())
                {
                  if (dump_enabled_p ())
                    dump_printf_loc (MSG_MISSED_OPTIMIZATION, 
vect_location,

and implementing matches_complex_rotate accordingly.

Richard.

> Kind Regards,
> Tamar
> 
> > 
> > Thanks,
> > Richard.
> > 
> > 
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 2018-11-11  Tamar Christina  <tamar.christ...@arm.com>
> > >
> > >         * doc/md.texi (fcadd, fcmla): New.
> > >         * doc/sourcebuild.texi (vect_complex_rot_): New.
> > >         * internal-fn.def (FCOMPLEX_ADD_ROT90): New.
> > >         (FCOMPLEX_ADD_ROT270): New.
> > >         (FCOMPLEX_FMA_ROT0): New.
> > >         (FCOMPLEX_FMA_ROT180): New.
> > >         * optabs.def (fcadd90_optab, fcadd270_optab,
> > >         fcmla0_optab, fcmla180_optab):  New.
> > >         * tree-vect-patterns.c (vect_mark_pattern_stmts): Export function.
> > >         * tree-vect-slp.c (vect_create_complex_patt_stmt): New.
> > >         (vect_is_complex_transform_valid): New.
> > >         (vect_reorder_based_on_load_chain): New.
> > >         (vect_determine_place_in_load_1): New.
> > >         (vect_determine_place_in_load): New.
> > >         (vect_balance_load_nodes_1): New.
> > >         (vect_balance_load_nodes): New.
> > >         (vect_match_expression_p): New.
> > >         (vect_detect_pair_op): New.
> > >         (vect_delete_slp_nodes_1): New.
> > >         (vect_delete_slp_nodes): New.
> > >         (vect_match_slp_patterns_1): New.
> > >         (vect_match_call_complex_add): New.
> > >         (vect_match_call_complex_mla_1): New.
> > >         (vect_match_call_complex_mla): New.
> > >         (vect_match_slp_patterns): New.
> > >         (vect_analyze_slp_instance): Use it.
> > >         * tree-vectorizer.h (vect_mark_pattern_stmts): Export function.
> > >
> > > --
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

RE: [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP recognizer for Complex addition with rotate and complex MLA with rotation

Reply via email to