On Wed, 14 Nov 2018, Tamar Christina wrote: > Hi Richard, > > Thanks for the feedback, I've replied inline below. > I'll wait for your answers before making changes.
I have commented on the other mail so will leave out redunant parts here. > > -----Original Message----- > > From: Richard Biener <richard.guent...@gmail.com> > > Sent: Wednesday, November 14, 2018 12:21 > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; nd <n...@arm.com>; Richard > > Guenther <rguent...@suse.de>; Zdenek Dvorak <o...@ucw.cz>; Richard > > Earnshaw <richard.earns...@arm.com>; James Greenhalgh > > <james.greenha...@arm.com>; Marcus Shawcroft > > <marcus.shawcr...@arm.com> > > Subject: Re: [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP > > recognizer for Complex addition with rotate and complex MLA with rotation > > > > On Sun, Nov 11, 2018 at 11:26 AM Tamar Christina > > <tamar.christ...@arm.com> wrote: > > > > > > Hi All, > > > > > > This patch adds support for SLP vectorization of Complex number > > > arithmetic with rotations along with Argand plane. > > > > > > For this to work it has to recognize two statements in parallel as it > > > needs to match against operations towards both the real and imaginary > > > numbers. The instructions generated by this change is also only > > > available in their vector forms. As such I add them as pattern > > > statements to > > the stmt. The BB is left untouched and so the scalar loop is untouched. > > > > > > The instructions also require the loads to be contiguous and so when a > > > match is made, and the code decides it is able to do the replacement > > > it re-organizes the SLP tree such that the loads are contiguous. > > > Since this operation cannot be undone it only does this if it's sure that > > > the > > resulting loads can be made continguous. > > > > > > It gets this guarantee by only allowing the replacement if between the > > > matched expression and the loads there are no other expressions it > > doesn't know aside from type casts. > > > > > > When a match occurs over multiple expressions, the dead statements are > > > immediately removed from the tree to prevent verification failures later. > > > > > > Because the pattern matching is done after SLP analysis has analyzed > > > the usage of the instruction it also marks the instructions as used and > > > the > > old ones as unusued. > > > > > > When a replacement is done a new internal function is generated which > > > the back-end has to expand to the proper instruction sequences. > > > > > > For now, this patch only adds support for Complex addition with rotate > > > and Complex FMLA with rotation of 0 and 180. However it is the > > > intention to in the future add support for Complex subtraction and > > Complex multiplication. > > > > > > Concretely, this generates > > > > > > ldr q1, [x1, x3] > > > ldr q2, [x0, x3] > > > ldr q0, [x2, x3] > > > fcmla v0.2d, v1.2d, v2.2d, #180 > > > fcmla v0.2d, v1.2d, v2.2d, #90 > > > str q0, [x2, x3] > > > add x3, x3, 16 > > > cmp x3, 3200 > > > bne .L2 > > > ret > > > > > > now instead of > > > > > > add x3, x2, 31 > > > sub x4, x3, x1 > > > sub x3, x3, x0 > > > cmp x4, 62 > > > mov x4, 62 > > > ccmp x3, x4, 0, hi > > > bls .L5 > > > mov x3, x0 > > > mov x0, x1 > > > add x1, x2, 3200 > > > .p2align 3,,7 > > > .L3: > > > ld2 {v16.2d - v17.2d}, [x2] > > > ld2 {v2.2d - v3.2d}, [x3], 32 > > > ld2 {v0.2d - v1.2d}, [x0], 32 > > > mov v7.16b, v17.16b > > > fmul v6.2d, v0.2d, v3.2d > > > fmla v7.2d, v1.2d, v3.2d > > > fmla v6.2d, v1.2d, v2.2d > > > fmls v7.2d, v2.2d, v0.2d > > > fadd v4.2d, v6.2d, v16.2d > > > mov v5.16b, v7.16b > > > st2 {v4.2d - v5.2d}, [x2], 32 > > > cmp x2, x1 > > > bne .L3 > > > ret > > > .L5: > > > add x4, x2, 8 > > > add x6, x0, 8 > > > add x5, x1, 8 > > > mov x3, 0 > > > .p2align 3,,7 > > > .L2: > > > ldr d1, [x6, x3] > > > ldr d4, [x1, x3] > > > ldr d5, [x5, x3] > > > ldr d3, [x0, x3] > > > fmul d2, d4, d1 > > > ldr d0, [x4, x3] > > > fmadd d0, d5, d1, d0 > > > ldr d1, [x2, x3] > > > fmadd d2, d5, d3, d2 > > > fmsub d0, d4, d3, d0 > > > fadd d1, d1, d2 > > > str d1, [x2, x3] > > > str d0, [x4, x3] > > > add x3, x3, 16 > > > cmp x3, 3200 > > > bne .L2 > > > ret > > > > > > Bootstrap and Regtests on aarch64-none-linux-gnu, arm-none-gnueabihf > > > and x86_64-pc-linux-gnu are still on going but previous patch showed no > > regressions. > > > > > > The instructions have also been tested on aarch64-none-elf and > > > arm-none-eabi on a Armv8.3-a model and -march=Armv8.3-a+fp16 and all > > tests pass. > > > > > > Ok for trunk? > > > > I first have a few high-level questions. Complex addition when the complex > > values are in vectors looks trivial to me and maps to vector addition. Your > > md.texi description of fcadd mentions a rotation 'm' but doesn't further > > explain the details > > - I suppose > > fcadd@var{m}@var{n}3 really means fcadd90@var{n}3 and > > fcadd270@var{n}3, thus the rotation being encoded into the pattern name > > rather than having a mode m and an explicit operand? If that is so please > > list > > those patterns explicitely and separately. > > Yes that's correct, I'll list them separately. > > > Then I'm not sure why the vectorizer needs to be bothered with this? Can > > the instructions not be recognized by combine from the "rotate" and the > > vector add? > > That is, what happens if the user writes generic vector code for this? > > The reason I'm doing this in the vectorizer is that the vectorizer has > determined that it needs to load the values using a LOAD_LANES because of how > complex numbers are stored in pair. > > You'd have real,imag,real,imag,real,imag in v1 and v2. And in order for it to > vectorize the operation > > a * b * I which becomes > > real3 = real1 - imag2 > imag3 = imag1 + real2 > > It has to have vectors of only the real parts and vectors of only the > imaginary parts. > > However the new instructions expect the original layout of v1 and v2, not the > permuted ones created by GCC thinking it needs a LOAD_LANES. > This is why I re-order the loads, to cancel out the permute. The permute is > created because of the complex lowering and the operations on the lowered > complex number's components. The rotation cause operations to be performed on > the real component of one complex number and imaginary component of the > other. > > > > > Can you also explicitely spell out what "rotation by 90 or 270" means for > > the > > operation? If I decipher the text OK then only one operand (operand 2) is > > rotated and operand 1 is unchanged? Or is the result rotated instead? > > > > Yes that's correct, I'll clarify that, it's only one operand that is rotated, > one of the inputs. > > > So it looks like the patch adds a general facility to recognize patterns on > > the > > SLP graph after SLP discovery. This complicates matters a lot it seems. > > Comments (unordered) about the actual patch: > > > > +static slp_tree > > +vect_match_slp_patterns_1 (slp_tree node, vec_info *vinfo, > > + unsigned int group_size, complex_pattern_t > > +patt_info) { > > + int i; > > + stmt_vec_info stmt_info; > > + hash_set<stmt_vec_info> exempt; > > + auto_vec<tree> v1, v2; > > + vec<stmt_vec_info> scalar_stmts = SLP_TREE_SCALAR_STMTS (node); > > + bool keep_looking = false; > > + > > + if (group_size != 2) > > + return node; > > > > this seems quite arbitrary given once the user unrolls the loops you'll be > > faced with two complex operations. Did you at least think about how to > > generalize this without trying all permutations? > > > > Hmm no I hadn't considered the unrolling case much yet, but it does indeed > present and issue with matching. > An approach that could work is doing a light pass over all the operations and > ordering them based on how the order/location > of the first operand in in the load chain. Since the first vector should be > getting used sequentially. From this any two pairs of operations > with sequential even and odd locations in the load chain are possible > combinations. > > > What about vector ISAs > > where two or more complex numbers fit inside a vector? (SVE?) > > We already have that with NEON, a V4SF will contain two complex numbers, and > a V8HF four. My expectation is that SVE should just work > as the operations have the same semantics for it, It'll just have to use one > of the sizeless vector types, but since SLP vectorization doesn't seem to > happen atm for this sequence I couldn't test it yet. > > > + /* If we haven't matched anything, look deeper. */ if > > + (keep_looking) > > + { > > + slp_tree child; > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) > > + SLP_TREE_CHILDREN (node)[i] > > + = vect_match_slp_patterns_1 (child, vinfo, group_size, patt_info); > > + } > > > > usually pattern matching should work from defs to uses, thus recurse first? > > And, > > Ah yes, I'll change that. > > > > > + keep_looking = gimple_assign_cast_p (stmt_0) > > + || gimple_assign_load_p (stmt_0) > > + || gimple_store_p (stmt_0); > > > > loads will never have any SLP children. Stores are always the very first > > SLP > > node only. This effectively means you never look deeper than the very first > > arithmetic operation, just skipping an eventual widening/shortening?! > > Yes, this is a restriction I have put in place for the first version for > support of this. > The reason is that when I only have complex add, it would match the tree > partially > multiple times, and I hadn't convinced myself at the time that this was > correct or beneficial. > > I did want to revisit it, as my plans for the next version (GCC-10) would be > to recognize add, subtract and multiply > and use combine to recognize the multiply and add. > > > + /* Now push new statements. */ > > + SLP_TREE_SCALAR_STMTS (node).truncate (0); > > + > > + gcall *call_stmt_0 > > + = vect_create_complex_patt_stmt (node, opstmt_info_0, ifn, > > + v1, vinfo, type, vectype); > > + gcall *call_stmt_1 > > + = vect_create_complex_patt_stmt (node, opstmt_info_1, ifn, > > + v2, vinfo, type, vectype); > > > > do not perform the SLP tree modification in those helpers, that looks odd. > > In that function I see you do sth like > > > > + vect_mark_pattern_stmts (stmt_info, call_stmt, vectype); > > + STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope; > > + STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope; > > > > I don't think that will fly given that the decision on whether the stmt is > > used > > only by SLP is done only later (hybrid SLP). Usually patterns allow the > > original > > pieces still to be used, not sure why you try to be clever and not allow > > this. > > They might end up being used by other SLP instances as well which are either > > already detected or only will be discovered later. > > True, the intention is that the tree created is only a pure SLP tree and I > explicitly mark the statements as required because of the changing of the > load permute. If the original operations were to be used inside the SLP tree > then the result will be wrong. Each operation > would no longer work on groups of real/imag numbers but instead pairs of > real+imag numbers. > > Also not marking the statements as used ends up having the > vectorizable_operation call determine that they're irrelevant and are just > dropped from the output entirely. > > > > > + /* Re-order the loads, first in the SLP tree. */ > > + vect_delete_slp_nodes (&node, node); > > + vect_balance_load_nodes (node, group_size); > > + SLP_TREE_TWO_OPERATORS (node) = false; > > > > +static bool > > +vect_is_complex_transform_valid (slp_tree node, hash_set<stmt_vec_info> > > +exempt) { > > > > this function doesn't look generic - why's it not simply called from the > > pattern > > recognition function? I also do not understand why there cannot be any > > operations between the complex add and the loads? > > > > It is only called from vect_match_slp_patterns_1. > > > I also do not understand why there cannot be any > > operations between the complex add and the loads? > > As I mentioned above, I was being conservative here. My fear was that > partially replacing instructions and undoing the permutes may give the wrong > results if there's an instruction left over that was expecting the permute to > be there, I've expanded on this below. > > > IMHO vect_delete_slp_nodes_1 shouldn't do any stmt removal (there's > > nothing to DSE, only intermediate scalar operations become dead). > > The problem is it doesn't see them as dead. In the end it still creates > vector statements for them and then it'll ICE later because (from what I > can gather) the reordering of the loads makes it create constant vectors > (vector_cst_) so the verification fails as (I think) it hasn't seen all > of the definitions in the order it expected for those instructions (e.g. > definition before use). I think you're doing sth wrong then and this papers over the issue. > > Again you are hard-coding knowledge of the specific patterns in this > > function, > > instead the pattern recognition function should deal with this. > > Note the SLP tree is now a graph so there can be multiple uses of a node. > > I could unlink the node form the matched parent instead of deleting it if it > has more than one usage. > > > Overall I'm not convinced the SLP pattern matching will work in the way you > > implemented it. At least it has to be moved later until after > > vect_detect_hybrid_slp where we know whether stmts are also used by > > non-SLP parts and when we have discovered all SLP instances. > > > > The reason I haven't done this after vect_detect_hybrid is because > vect_analyze_slp_instance is where SLP_TREE_LOAD_PERMUTATION and it > decides to cancel the SLP build if load/store lanes are supported. A > critical part of this whole thing is that the permute should no longer > be done. Yes, but you can't do what you do before vect_detect_hybrid ... See the other mail where I point out that store/load-lane behavior. I don't think you can have it in a way that you know whether you have pure or hybrid SLP during pattern detection. That means it has to work for both cases. > > Then the pattern detection needs to apply more broadly, thus you should > > relax the constraints you put on earlier and later operations. > > > > I can do that, like I said they were just very conservative, I think > they should work and I can relax most of them. At least the infrastructure should be written in a way that I can plug in different pattern recognizers doing sth completely different which means putting almost all logic into the callbacks. > > I'd not do any of the tieing to loads or order of loads - as far as I > > understand > > the instructions do not perform loads themselves. > > No, but they do require a particular order of the values in the > registers. As before, the normal statements would have created these > permutes so that they work on grouping of the parts of the complex > numbers. e.g. subtract a vector of the real parts from a vector of > imaginary parts. > > The new instructions require the vectors to be their original > alternating sequence of real/imag. So you do have to change the order, > this is why you gain so much in the generated code sequence. You simply have to refrain from using load/store-lanes. > > And your pictures of the rotation effect only mentions changed operations, > > not shuffles. That is, optimizing shuffles should be done by combine later > > (or > > by the much sought SLP shuffle optimization). > > > > For the add you probably can get it all done by combine. > > Sure, but add is the exceptional case here, and it's just at the limits of > combine, requiring you to match two loads, two instructions and a store. > > Matching just the addition and subtraction won't be correct, as I've > mentioned above. Multiplication and FMA will both not work on combine. It > would be much nicer to handle them all in the same place instead of spread > around in target specific combine patterns. > > > > > For the FMA did you think about handling it in regular pattern recognition > > instead? Your matching is so constrained that this shouldn't be too > > difficult. > > The regular pattern matching has two issues, 1) I have to match against > two statements at the same time, and replace both. As far as I can tell, > while the regular pattern matching would allow you to match against > multiple statements by walking the BB, you can only return one pattern > from the matcher to replace the currently being scrutinized statement. > > You need to replace both statements and have the operands in an order > where it will realise that it can combine them into one vector > instruction instead of trying to unroll the loop to create the vector. True. > 2) Secondly, the regular pattern matching doesn't allow me to change the > loads, if I end up with a permute, the instruction is going to permute > the values again. The answer would be incorrect. I guess given 1) this is irrelevant. But yes, the whole point is to _not_ vectorize using load/store-lanes. I think that is what you need to fix first, for example by diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index f2bb8da9de2..9e91be28c7c 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -2067,6 +2067,7 @@ vect_analyze_slp_instance (vec_info *vinfo, && dr && vect_store_lanes_supported (vectype, group_size, false)) { slp_tree load_node; + bool all_complex_rotates = true; FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (new_instance), i, load_node) { stmt_vec_info stmt_vinfo = DR_GROUP_FIRST_ELEMENT @@ -2077,8 +2078,12 @@ vect_analyze_slp_instance (vec_info *vinfo, (STMT_VINFO_VECTYPE (stmt_vinfo), DR_GROUP_SIZE (stmt_vinfo), false)) break; + if (SLP_TREE_LOAD_PERMUTATION (load_node) + && !matches_complex_rotate (SLP_TREE_LOAD_PERMUTATION (load_node))) + all_complex_rotates = false; } - if (i == SLP_INSTANCE_LOADS (new_instance).length ()) + if (!all_complex_rotates + && i == SLP_INSTANCE_LOADS (new_instance).length ()) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, and implementing matches_complex_rotate accordingly. Richard. > Kind Regards, > Tamar > > > > > Thanks, > > Richard. > > > > > > > Thanks, > > > Tamar > > > > > > gcc/ChangeLog: > > > > > > 2018-11-11 Tamar Christina <tamar.christ...@arm.com> > > > > > > * doc/md.texi (fcadd, fcmla): New. > > > * doc/sourcebuild.texi (vect_complex_rot_): New. > > > * internal-fn.def (FCOMPLEX_ADD_ROT90): New. > > > (FCOMPLEX_ADD_ROT270): New. > > > (FCOMPLEX_FMA_ROT0): New. > > > (FCOMPLEX_FMA_ROT180): New. > > > * optabs.def (fcadd90_optab, fcadd270_optab, > > > fcmla0_optab, fcmla180_optab): New. > > > * tree-vect-patterns.c (vect_mark_pattern_stmts): Export function. > > > * tree-vect-slp.c (vect_create_complex_patt_stmt): New. > > > (vect_is_complex_transform_valid): New. > > > (vect_reorder_based_on_load_chain): New. > > > (vect_determine_place_in_load_1): New. > > > (vect_determine_place_in_load): New. > > > (vect_balance_load_nodes_1): New. > > > (vect_balance_load_nodes): New. > > > (vect_match_expression_p): New. > > > (vect_detect_pair_op): New. > > > (vect_delete_slp_nodes_1): New. > > > (vect_delete_slp_nodes): New. > > > (vect_match_slp_patterns_1): New. > > > (vect_match_call_complex_add): New. > > > (vect_match_call_complex_mla_1): New. > > > (vect_match_call_complex_mla): New. > > > (vect_match_slp_patterns): New. > > > (vect_analyze_slp_instance): Use it. > > > * tree-vectorizer.h (vect_mark_pattern_stmts): Export function. > > > > > > -- > > -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)