On Thu, Dec 14, 2023 at 9:55 PM Di Zhao OS <diz...@os.amperecomputing.com> wrote: > > > > -----Original Message----- > > From: Richard Biener <richard.guent...@gmail.com> > > Sent: Wednesday, December 13, 2023 5:01 PM > > To: Di Zhao OS <diz...@os.amperecomputing.com> > > Cc: gcc-patches@gcc.gnu.org > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in > > get_reassociation_width > > > > On Wed, Dec 13, 2023 at 9:14 AM Di Zhao OS > > <diz...@os.amperecomputing.com> wrote: > > > > > > Hello Richard, > > > > > > > -----Original Message----- > > > > From: Richard Biener <richard.guent...@gmail.com> > > > > Sent: Monday, December 11, 2023 7:01 PM > > > > To: Di Zhao OS <diz...@os.amperecomputing.com> > > > > Cc: gcc-patches@gcc.gnu.org > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in > > > > get_reassociation_width > > > > > > > > On Wed, Nov 29, 2023 at 3:36 PM Di Zhao OS > > > > <diz...@os.amperecomputing.com> wrote: > > > > > > > > > > > -----Original Message----- > > > > > > From: Richard Biener <richard.guent...@gmail.com> > > > > > > Sent: Tuesday, November 21, 2023 9:01 PM > > > > > > To: Di Zhao OS <diz...@os.amperecomputing.com> > > > > > > Cc: gcc-patches@gcc.gnu.org > > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in > > > > > > get_reassociation_width > > > > > > > > > > > > On Thu, Nov 9, 2023 at 6:53 PM Di Zhao OS > > <diz...@os.amperecomputing.com> > > > > > > wrote: > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Richard Biener <richard.guent...@gmail.com> > > > > > > > > Sent: Tuesday, October 31, 2023 9:48 PM > > > > > > > > To: Di Zhao OS <diz...@os.amperecomputing.com> > > > > > > > > Cc: gcc-patches@gcc.gnu.org > > > > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA > > > > > > > > in > > > > > > > > get_reassociation_width > > > > > > > > > > > > > > > > On Sun, Oct 8, 2023 at 6:40 PM Di Zhao OS > > > > <diz...@os.amperecomputing.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Attached is a new version of the patch. > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > From: Richard Biener <richard.guent...@gmail.com> > > > > > > > > > > Sent: Friday, October 6, 2023 5:33 PM > > > > > > > > > > To: Di Zhao OS <diz...@os.amperecomputing.com> > > > > > > > > > > Cc: gcc-patches@gcc.gnu.org > > > > > > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider > > FMA in > > > > > > > > > > get_reassociation_width > > > > > > > > > > > > > > > > > > > > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS > > > > > > > > > > <diz...@os.amperecomputing.com> wrote: > > > > > > > > > > > > > > > > > > > > > > This is a new version of the patch on "nested FMA". > > > > > > > > > > > Sorry for updating this after so long, I've been studying > > and > > > > > > > > > > > writing micro cases to sort out the cause of the > > > > > > > > > > > regression. > > > > > > > > > > > > > > > > > > > > Sorry for taking so long to reply. > > > > > > > > > > > > > > > > > > > > > First, following previous discussion: > > > > > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023- > > > > > > September/629080.html) > > > > > > > > > > > > > > > > > > > > > > 1. From testing more altered cases, I don't think the > > > > > > > > > > > problem is that reassociation works locally. In that: > > > > > > > > > > > > > > > > > > > > > > 1) On the example with multiplications: > > > > > > > > > > > > > > > > > > > > > > tmp1 = a + c * c + d * d + x * y; > > > > > > > > > > > tmp2 = x * tmp1; > > > > > > > > > > > result += (a + c + d + tmp2); > > > > > > > > > > > > > > > > > > > > > > Given "result" rewritten by width=2, the performance is > > > > > > > > > > > worse if we rewrite "tmp1" with width=2. In contrast, if > > we > > > > > > > > > > > remove the multiplications from the example (and make > > "tmp1" > > > > > > > > > > > not singe used), and still rewrite "result" by width=2, > > then > > > > > > > > > > > rewriting "tmp1" with width=2 is better. (Make sense > > because > > > > > > > > > > > the tree's depth at "result" is still smaller if we > > rewrite > > > > > > > > > > > "tmp1".) > > > > > > > > > > > > > > > > > > > > > > 2) I tried to modify the assembly code of the example > > without > > > > > > > > > > > FMA, so the width of "result" is 4. On Ampere1 there's > > > > > > > > > > > no > > > > > > > > > > > obvious improvement. So although this is an interesting > > > > > > > > > > > problem, it doesn't seem like the cause of the > > > > > > > > > > > regression. > > > > > > > > > > > > > > > > > > > > OK, I see. > > > > > > > > > > > > > > > > > > > > > 2. From assembly code of the case with FMA, one problem is > > > > > > > > > > > that, rewriting "tmp1" to parallel didn't decrease the > > > > > > > > > > > minimum CPU cycles (taking MULT_EXPRs into account), but > > > > > > > > > > > increased code size, so the overhead is increased. > > > > > > > > > > > > > > > > > > > > > > a) When "tmp1" is not re-written to parallel: > > > > > > > > > > > fmadd d31, d2, d2, d30 > > > > > > > > > > > fmadd d31, d3, d3, d31 > > > > > > > > > > > fmadd d31, d4, d5, d31 //"tmp1" > > > > > > > > > > > fmadd d31, d31, d4, d3 > > > > > > > > > > > > > > > > > > > > > > b) When "tmp1" is re-written to parallel: > > > > > > > > > > > fmul d31, d4, d5 > > > > > > > > > > > fmadd d27, d2, d2, d30 > > > > > > > > > > > fmadd d31, d3, d3, d31 > > > > > > > > > > > fadd d31, d31, d27 //"tmp1" > > > > > > > > > > > fmadd d31, d31, d4, d3 > > > > > > > > > > > > > > > > > > > > > > For version a), there are 3 dependent FMAs to calculate > > "tmp1". > > > > > > > > > > > For version b), there are also 3 dependent instructions in > > the > > > > > > > > > > > longer path: the 1st, 3rd and 4th. > > > > > > > > > > > > > > > > > > > > Yes, it doesn't really change anything. The patch has > > > > > > > > > > > > > > > > > > > > + /* If there's code like "acc = a * b + c * d + acc" in a > > tight > > > > loop, > > > > > > > > some > > > > > > > > > > + uarchs can execute results like: > > > > > > > > > > + > > > > > > > > > > + _1 = a * b; > > > > > > > > > > + _2 = .FMA (c, d, _1); > > > > > > > > > > + acc_1 = acc_0 + _2; > > > > > > > > > > + > > > > > > > > > > + in parallel, while turning it into > > > > > > > > > > + > > > > > > > > > > + _1 = .FMA(a, b, acc_0); > > > > > > > > > > + acc_1 = .FMA(c, d, _1); > > > > > > > > > > + > > > > > > > > > > + hinders that, because then the first FMA depends on > > > > > > > > > > the > > > > result > > > > > > > > > > of preceding > > > > > > > > > > + iteration. */ > > > > > > > > > > > > > > > > > > > > I can't see what can be run in parallel for the first case. > > > > The .FMA > > > > > > > > > > depends on the multiplication a * b. Iff the uarch somehow > > > > decomposes > > > > > > > > > > .FMA into multiply + add then the c * d multiply could run > > > > > > > > > > in > > > > parallel > > > > > > > > > > with the a * b multiply which _might_ be able to hide some > > > > > > > > > > of > > the > > > > > > > > > > latency of the full .FMA. Like on x86 Zen FMA has a latency > > of 4 > > > > > > > > > > cycles but a multiply only 3. But I never got confirmation > > from > > > > any > > > > > > > > > > of the CPU designers that .FMAs are issued when the multiply > > > > > > > > > > operands are ready and the add operand can be forwarded. > > > > > > > > > > > > > > > > > > > > I also wonder why the multiplications of the two-FMA > > > > > > > > > > sequence > > > > > > > > > > then cannot be executed at the same time? So I have some > > doubt > > > > > > > > > > of the theory above. > > > > > > > > > > > > > > > > > > The parallel execution for the code snippet above was the > > > > > > > > > other > > > > > > > > > issue (previously discussed here: > > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023- > > August/628960.html). > > > > > > > > > Sorry it's a bit confusing to include that here, but these 2 > > fixes > > > > > > > > > needs to be combined to avoid new regressions. Since > > > > > > > > > considering > > > > > > > > > FMA in get_reassociation_width produces more results of > > > > > > > > > width=1, > > > > > > > > > so there would be more loop depending FMA chains. > > > > > > > > > > > > > > > > > > > Iff this really is the reason for the sequence to execute > > > > > > > > > > with > > > > lower > > > > > > > > > > overall latency and we want to attack this on GIMPLE then I > > think > > > > > > > > > > we need a target hook telling us this fact (I also wonder if > > such > > > > > > > > > > behavior can be modeled in the scheduler pipeline > > > > > > > > > > description > > at > > > > all?) > > > > > > > > > > > > > > > > > > > > > So it seems to me the current get_reassociation_width > > algorithm > > > > > > > > > > > isn't optimal in the presence of FMA. So I modified the > > patch to > > > > > > > > > > > improve get_reassociation_width, rather than check for > > > > > > > > > > > code > > > > > > > > > > > patterns. (Although there could be some other complicated > > > > > > > > > > > factors so the regression is more obvious when there's > > "nested > > > > > > > > > > > FMA". But with this patch that should be avoided or > > > > > > > > > > > reduced.) > > > > > > > > > > > > > > > > > > > > > > With this patch 508.namd_r 1-copy run has 7% improvement > > > > > > > > > > > on > > > > > > > > > > > Ampere1, on Intel Xeon there's about 3%. While I'm still > > > > > > > > > > > collecting data on other CPUs, I'd like to know how do you > > > > > > > > > > > think of this. > > > > > > > > > > > > > > > > > > > > > > About changes in the patch: > > > > > > > > > > > > > > > > > > > > > > 1. When the op list forms a complete FMA chain, try to > > search > > > > > > > > > > > for a smaller width considering the benefit of using FMA. > > With > > > > > > > > > > > a smaller width, the increment of code size is smaller > > > > > > > > > > > when > > > > > > > > > > > breaking the chain. > > > > > > > > > > > > > > > > > > > > But this is all highly target specific (code size even more > > so). > > > > > > > > > > > > > > > > > > > > How I understand your approach to fixing the issue leads me > > > > > > > > > > to > > > > > > > > > > the suggestion to prioritize parallel rewriting, thus alter > > > > > > > > rank_ops_for_fma, > > > > > > > > > > taking the reassoc width into account (the computed width > > should > > > > be > > > > > > > > > > unchanged from rank_ops_for_fma) instead of "fixing up" the > > > > parallel > > > > > > > > > > rewriting of FMAs (well, they are not yet formed of course). > > > > > > > > > > get_reassociation_width has 'get_required_cycles', the above > > > > theory > > > > > > > > > > could be verified with a very simple toy pipeline model. > > > > > > > > > > We'd > > > > have > > > > > > > > > > to ask the target for the reassoc width for MULT_EXPRs as > > > > > > > > > > well > > (or > > > > > > maybe > > > > > > > > > > even FMA_EXPRs). > > > > > > > > > > > > > > > > > > > > Taking the width of FMAs into account when computing the > > reassoc > > > > width > > > > > > > > > > might be another way to attack this. > > > > > > > > > > > > > > > > > > Previously I tried to solve this generally, on the assumption > > that > > > > > > > > > FMA (smaller code size) is preferred. Now I agree it's > > > > > > > > > difficult > > > > > > > > > since: 1) As you mentioned, the latency of FMA, FMUL and FADD > > can > > > > > > > > > be different. 2) From my test result on different machines we > > > > > > > > > have, it seems simply adding the cycles together is not a good > > way > > > > > > > > > to estimate the latency of consecutive FMA. > > > > > > > > > > > > > > > > > > I think an easier way to fix this is to add a parameter to > > suggest > > > > > > > > > the length of complete FMA chain to keep. (It can be set by > > target > > > > > > > > > specific tuning then.) And we can break longer FMA chains for > > > > > > > > > better parallelism. Attached is the new implementation. With > > > > > > > > > max-fma-chain-len=8, there's about 7% improvement in spec2017 > > > > > > > > > 508.namd_r on ampere1, and the overall improvement on fprate > > > > > > > > > is > > > > > > > > > about 1%. > > > > > > > > > > > > > > > > > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs > > from > > > > > > > > > others, I left it before get_reassociation_width so the number > > of > > > > > > > > > MULT_EXPRs can be used. > > > > > > > > > > > > > > > > Sorry again for the delay in replying. > > > > > > > > > > > > > > > > + /* Check if keeping complete FMA chains is preferred. */ > > > > > > > > + if (width > 1 && mult_num >= 2 && param_max_fma_chain_len) > > > > > > > > + { > > > > > > > > + /* num_fma_chain + (num_fma_chain - 1) >= num_plus . */ > > > > > > > > + int num_others = ops_num - mult_num; > > > > > > > > + int num_fma_chain = CEIL (num_others + 1, 2); > > > > > > > > + > > > > > > > > + if (num_fma_chain < width > > > > > > > > + && CEIL (mult_num, num_fma_chain) <= > > param_max_fma_chain_len) > > > > > > > > + width = num_fma_chain; > > > > > > > > + } > > > > > > > > > > > > > > > > so here 'mult_num' serves as a heuristical value how many > > > > > > > > FMAs we could build. If that were close to ops_num - 1 then > > > > > > > > we'd have a chain of FMAs. Not sure how you get at > > > > > > > > num_others / 2 here. Maybe we need to elaborate on what an > > > > > > > > FMA chain is? I thought it is FMA (FMA (FMA (..., b, c), d, > > > > > > > > e), f, > > g) > > > > > > > > where each (b,c) pair is really just one operand in the ops > > > > > > > > array, > > > > > > > > one of the 'mult's. Thus a FMA chain is _not_ > > > > > > > > FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right? > > > > > > > > > > > > > > The "FMA chain" here refers to consecutive FMAs, each taking > > > > > > > The previous one's result as the third operator, i.e. > > > > > > > ... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op > > > > > > > list looks like "r + a * b + c * d + e * f + ...". These FMAs > > > > > > > will end up using the same accumulate register. > > > > > > > > > > > > > > When num_others=2 or 3, there can be 2 complete chains, e.g. > > > > > > > FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) > > > > > > > or > > > > > > > FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i . > > > > > > > And so on, that's where the "CEIL (num_others + 1, 2)" comes from. > > > > > > > > > > > > > > > > > > > > > > > Forming an FMA chain effectively reduces the reassociation width > > > > > > > > of the participating multiplies. If we were not to form FMAs > > > > > > > > all > > > > > > > > the multiplies could execute in parallel. > > > > > > > > > > > > > > > > So what does the above do, in terms of adjusting the > > > > > > > > reassociation > > > > > > > > width for the _adds_, and what's the ripple-down effect on later > > > > > > > > FMA forming? > > > > > > > > > > > > > > > > > > > > > > The above code calculates the number of such FMA chains in the op > > > > > > > list. And if the length of each chain doesn't exceed > > > > > > > param_max_fma_chain_len, then width is set to the number of > > > > > > > chains, > > > > > > > so we won't break them (because rewrite_expr_tree_parallel handles > > > > > > > this well). > > > > > > > > > > > > > > > The change still feels like whack-a-mole playing rather than > > > > understanding > > > > > > > > the fundamental issue on the targets. > > > > > > > > > > > > > > I think the complexity is in how the instructions are piped. > > > > > > > Some Arm CPUs such as Neoverse V2 supports "late-forwarding": > > > > > > > "FP multiply-accumulate pipelines support late-forwarding of > > > > > > > accumulate operands from similar μOPs, allowing a typical > > > > > > > sequence of multiply-accumulate μOPs to issue one every N > > > > > > > cycles". ("N" is smaller than the latency of a single FMA > > > > > > > instruction.) So keeping such FMA chains can utilize such > > > > > > > feature and uses less FP units. I guess the case is similar on > > > > > > > some late X86 CPUs. > > > > > > > > > > > > > > If we try to compute the minimum circles of each option, I think > > > > > > > at least we'll need to know whether the target has similar > > > > > > > feature, and the latency of each uop. While using an > > > > > > > experiential length of beneficial FMA chain could be a shortcut. > > > > > > > (Maybe allowing different lengths for different data widths is > > > > > > > better.) > > > > > > > > > > > > Hm. So even when we can late-forward in an FMA chain > > > > > > increasing the width should typically be still better? > > > > > > > > > > > > _1 = FMA (_2 * _3 + _4); > > > > > > _5 = FMA (_6 * _7 + _1); > > > > > > > > > > > > say with late-forwarding we can hide the latency of the _6 * _7 > > > > > > multiply and the overall latency of the two FMAs above become > > > > > > lat (FMA) + lat (ADD) in the ideal case. Alternatively we do > > > > > > > > > > > > _1 = FMA (_2 * _ 3 + _4); > > > > > > _8 = _6 * _ 7; > > > > > > _5 = _1 + _8; > > > > > > > > > > > > where if the FMA and the multiply can execute in parallel > > > > > > (we have two FMA pipes) the latency would be lat (FMA) + lat (ADD). > > > > > > But when we only have a single pipeline capable of > > > > > > FMA or multiplies then it is at least MIN (lat (FMA) + 1, lat (MUL) > > > > > > + > > 1) > > > > > > + lat (ADD), it depends on luck whether the FMA or the MUL is > > > > > > issued first there. > > > > > > > > > > > > So if late-forward works really well and the add part of the FMA > > > > > > has very low latency compared to the multiplication part having > > > > > > a smaller reassoc width should pay off here and we might be > > > > > > able to simply control this via the existing target hook? > > > > > > > > > > > > I'm not aware of x86 CPUs having late-forwarding capabilities > > > > > > but usually the latency of multiplication and FMA is very similar > > > > > > and one can issue two FMAs and possibly more ADDs in parallel. > > > > > > > > > > > > As said I think this detail (late-forward) should maybe reflected > > > > > > into get_required_cycles, possibly guided by a different > > > > > > targetm.sched.reassociation_width for MULT_EXPR vs PLUS_EXPR? > > > > > > > > > > > > > > > > To my understanding, the question is whether the target fully > > > > > pipelines FMA instructions, so the MULT part can start first if > > > > > its operands are ready. While targetm.sched.reassociation_width > > > > > reflects the number of pipes for some operation, so it can guide > > > > > get_required_cycles for a sequence of identical operations > > > > > (e.g. A * B * C * D or A + B + C + D). Since the problem in > > > > > this case is not the number of pipes for FMA, I think another > > > > > indicator maybe better. > > > > > > > > > > (Currently the fma_reassoc_width for AArch64 is to control > > > > > whether reassociation on FADD is OK. This workaround doesn't > > > > > work well on some cases, for example it turns down reassociation > > > > > even when there's no FMA at all. So I think we'd better not > > > > > follow the schema.) > > > > > > > > > > Attached is a new version of the patch with a flag to indicate > > > > > whether FMA is fully pipelined, and: 1) lat (MUL) >= lat (ADD); > > > > > 2) symmetric units are used or FMUL/FADD/FMA. Otherwise the > > > > > patch may not be beneficial. > > > > > > > > > > It tries to calculate the latencies including MULT_EXPRs. Since > > > > > the code is different with the current code (the quick-search > > > > > part), I haven't included it inside get_required_cycles. > > > > > > > > +; If the flag 'fully-pipelined-fma' is set, reassociation takes into > > account > > > > +; the benifit of parallelizing FMA's multiply part and addition part. > > > > +ffully-pipelined-fma > > > > +Common Var(flag_fully_pipelined_fma) > > > > +Assume the target fully pipelines FMA instruction, and symmetric units > > are > > > > used > > > > +for FMUL/FADD/FMA. > > > > > > > > please use a --param for now, I think targets might want to set this > > > > based > > > > on active core tuning. > > > > > > > > +/* Given that the target fully pipelines FMA instructions, return > > > > latency > > of > > > > + MULT_EXPRs that can't be hided by FMA. WIDTH is the number of > > > > pipes. > > */ > > > > + > > > > > > > > return the latency .. can't be hidden by the FMA > > > > > > > > For documentation purposes it should be stated that mult_num <= ops_num > > > > > > > > + /* If the target fully pipelines FMA instruction, the multiply part > > > > can > > > > start > > > > > > > > instructions > > > > > > > > + first if its operands are ready. Assuming symmetric pipes are > > > > used > > for > > > > > > > > s/first/already/ > > > > > > > > + FMUL/FADD/FMA, then for a sequence of FMA like: > > > > + > > > > + _8 = .FMA (_2, _3, _1); > > > > + _9 = .FMA (_5, _4, _8); > > > > + _10 = .FMA (_7, _6, _9); > > > > + > > > > + , if width=1, the latency is latency(MULT) + latency(ADD)*3. > > > > + While with width=2: > > > > + > > > > + _8 = _4 * _5; > > > > + _9 = .FMA (_2, _3, _1); > > > > + _10 = .FMA (_6, _7, _8); > > > > + _11 = _9 + _10; > > > > + > > > > + , it is latency(MULT)*2 + latency(ADD)*2. Assuming latency(MULT) > > > > <= > > > > + latency(ADD), the previous one is preferred. > > > > > > > > latency (MULT) >= latency (ADD)? > > > > > > > > ".. the first variant is preferred." > > > > > > > > + > > > > + Find out if we can get a smaller width considering FMA. */ > > > > > > Corrected these errors. Thank you for the corrections. > > > > > > > + /* When flag_fully_pipelined_fma is set, assumes symmetric pipes > > are > > > > used > > > > + for FMUL/FADD/FMA. */ > > > > + int lat_mul = get_mult_latency_consider_fma (ops_num, mult_num, > > width); > > > > > > > > what does "symmetric pipes" actually mean? For x86 Zen3 we have > > > > two FMA pipes (that can also do ADD and MUL) and one pipe that can > > > > do ADD. Is that then non-symmetric because we can issue more adds > > > > in parallel than FMA/MUL? > > > > btw, I double-checked and Zen3/4 have two pipes for FMUL/FMA and two > > separate pipes for FADD, the FMUL/FMA pipes cannot do FADD. FADD > > has a latency of 3 cycles while FMUL/FMA has a latency of 4 cycles. > > > > I'd say Zen is then not "symmetric" as in your definition? I do wonder > > what part of the pipeline characteristic could be derived from the > > reassoc_width target hook (maybe the number of pipes but not whether > > they are shared with another op). In theory the scheduling description > > could offer the info (if correct and precise enough), but I don't think > > there's > > a good way to query this details. > > Yes, that's not "symmetric". The current code is at least confusing > in that scenario. For instance, get_mult_latency_consider_fma uses > the width of FMUL, and get_required_cycles should use the width of > FADD, I think. > > As you pointed out earlier, a problem with reassociation is that it > works locally on single used operator lists, so the result may not > be globally optimum. For example, if using 2 pipes results in 4 circles, > and using 1 pipe results in 5 circles; the latter might be preferable > for saving 1 pipe for other calculations (in the basic block), if > there is any. If to enhance reassociation for this, it seems we need > to know the arrangement of pipes for each kind of operator? (Perhaps > targetm.sched.reassociation_width could return an index number > indicating the set of pipes for the operator, along with the width. > But this may not cover cases like Haswell, which has ports 0 and 1 > for multiply and port 1 for addition. Besides, I'm not yet clear > about how the algorithm should be.)
Yeah, while targetm.sched.reassociation_width is at least per operation and mode I expect it to return a "heuristic" value trying to factor this in. I think it would be useful to look whether generating a query API from the scheduler descriptions is possible somehow. Currently the main issue is that the pipeline description only indirectly couples to define_insns (via insn attributes usually) and those in turn only indirectly lead to the actual operation implemented (unless it's a define_expand and if not only if a RTL pattern is present). So I guess that the easiest way would be to amend the scheduler descriptions with optional, say, (define_cpu_unit_for_code "port0" "PLUS_EXPR") somehow also specifying the modes applicable. But maybe reverse engineering this from the scheduler + insn descriptions is reasonably possible as well (the information is more-or-less there I think). For just FMA we could also invent a new target hook to describe the setup of the plus/mult/fma pipes and their latency but for more precise cost modeling of say vectorization or unrolling a scheduling model that can be applied to all "gimple" operations is needed. Thanks, Richard. > Committed the patch at 8afdbcdd. > > Thanks, > Di > > > > > > "symmetric pipes" was to indicate that FADD/FMA/FMUL use the same unit > > > set, so the widths are uniform, and the calculations in this patch can > > > apply. I think this can be relaxed for scenarios like Zen3, by searching > > > for a smaller width only using the pipes for FMUL/FMA. But if the pipes > > > for FMUL and FADD are separated, for example 1 for FMA/FMUL and 2 other > > > pipes for FADD, then the minimum circle might be incorrect. > > > > > > Changed the descriptions and code in get_reassociation_width a bit to > > > include the case like Zen3: > > > > > > + /* When param_fully_pipelined_fma is set, assume FMUL and FMA use > > > the > > > + same units that can also do FADD. For other scenarios, such as > > when > > > + FMUL and FADD are using distinct units, the following code may > > > not > > > + apply. */ > > > + int width_mult = targetm.sched.reassociation_width (MULT_EXPR, > > > mode); > > > + gcc_checking_assert (width_mult <= width); > > > + > > > + /* Latency of MULT_EXPRs. */ > > > + int lat_mul > > > + = get_mult_latency_consider_fma (ops_num, mult_num, width_mult); > > > > The updated patch is OK. > > > > Thanks for your patience. > > > > Thanks, > > Richard. > > > > > > Otherwise this looks OK now. > > > > > > > > Thanks, > > > > Richard. > > > > > > > > > > > > + /* If there's loop dependent FMA result, return width=2 to > > avoid it. > > > > > > This > > > > > > > > is > > > > > > > > + better than skipping these FMA candidates in widening_mul. > > */ > > > > > > > > > > > > > > > > better than skipping, but you don't touch it there? I suppose > > width > > > > == 2 > > > > > > > > will bypass the skipping, right? This heuristic only comes in > > when > > > > the > > > > > > above > > > > > > > > change made width == 1, since otherwise we have an earlier > > > > > > > > > > > > > > > > if (width == 1) > > > > > > > > return width; > > > > > > > > > > > > > > > > which als guarantees width == 2 was allowed by the hook/param, > > right? > > > > > > > > > > > > > > Yes, that's right. > > > > > > > > > > > > > > > > > > > > > > > + if (width == 1 && mult_num > > > > > > > > + && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE > > (lhs))), > > > > > > > > + param_avoid_fma_max_bits)) > > > > > > > > + { > > > > > > > > + /* Look for cross backedge dependency: > > > > > > > > + 1. LHS is a phi argument in the same basic block it is > > defined. > > > > > > > > + 2. And the result of the phi node is used in OPS. */ > > > > > > > > + basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs)); > > > > > > > > + gimple_stmt_iterator gsi; > > > > > > > > + for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); > > > > > > > > gsi_next > > > > (&gsi)) > > > > > > > > + { > > > > > > > > + gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi)); > > > > > > > > + for (unsigned i = 0; i < gimple_phi_num_args (phi); > > > > > > > > ++i) > > > > > > > > + { > > > > > > > > + tree op = PHI_ARG_DEF (phi, i); > > > > > > > > + if (!(op == lhs && gimple_phi_arg_edge (phi, > > > > > > > > i)->src > > == > > > > bb)) > > > > > > > > + continue; > > > > > > > > > > > > > > > > I think it's easier to iterate over the immediate uses of LHS > > > > > > > > like > > > > > > > > > > > > > > > > FOR_EACH_IMM_USE_FAST (use_p, iter, lhs) > > > > > > > > if (gphi *phi = dyn_cast <gphi *> (USE_STMT (use_p))) > > > > > > > > { > > > > > > > > if (gimple_phi_arg_edge (phi, phi_arg_index_from_use > > > > > > > > (use_p))->src != bb) > > > > > > > > continue; > > > > > > > > ... > > > > > > > > } > > > > > > > > > > > > > > > > otherwise I think _this_ part of the patch looks reasonable. > > > > > > > > > > > > > > > > As you say heuristically they might go together but I think we > > should > > > > > > split > > > > > > > > the > > > > > > > > patch - the cross-loop part can probably stand independently. > > > > > > > > Can > > you > > > > > > adjust > > > > > > > > and re-post? > > > > > > > > > > > > > > Attached is the separated part for cross-loop FMA. Thank you for > > > > > > > the > > > > > > correction. > > > > > > > > > > > > That cross-loop FMA patch is OK. > > > > > > > > > > Committed this part at 746344dd. > > > > > > > > > > Thanks, > > > > > Di > > > > > > > > > > > > > > > > > Thanks, > > > > > > Richard. > > > > > > > > > > > > > > > > > > > > > > As for the first part I still don't understand very well and am > > still > > > > > > hoping > > > > > > > > we > > > > > > > > can get away without yet another knob to tune. > > > > > > > > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. To avoid regressions, included the other patch > > > > > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023- > > > > > > September/629203.html) > > > > > > > > > > > on this tracker again. This is because more FMA will be > > > > > > > > > > > kept > > > > > > > > > > > with 1., so we need to rule out the loop dependent > > > > > > > > > > > FMA chains when param_avoid_fma_max_bits is set. > > > > > > > > > > > > > > > > > > > > Sorry again for taking so long to reply. > > > > > > > > > > > > > > > > > > > > I'll note we have an odd case on x86 Zen2(?) as well which > > > > > > > > > > we > > > > don't > > > > > > really > > > > > > > > > > understand from a CPU behavior perspective. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Di Zhao > > > > > > > > > > > > > > > > > > > > > > ---- > > > > > > > > > > > > > > > > > > > > > > PR tree-optimization/110279 > > > > > > > > > > > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > > > > > > > > > > > * tree-ssa-reassoc.cc > > > > (rank_ops_for_better_parallelism_p): > > > > > > > > > > > New function to check whether ranking the ops > > results in > > > > > > > > > > > better parallelism. > > > > > > > > > > > (get_reassociation_width): Add new parameters. > > Search > > > > for > > > > > > > > > > > smaller width considering the benefit of FMA. > > > > > > > > > > > (rank_ops_for_fma): Change return value to be > > > > > > > > > > > number > > of > > > > > > > > > > > MULT_EXPRs. > > > > > > > > > > > (reassociate_bb): For 3 ops, refine the condition > > > > > > > > > > > to > > > > call > > > > > > > > > > > swap_ops_for_binary_stmt. > > > > > > > > > > > > > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > > > > > > > > > > > > > * gcc.dg/pr110279.c: New test. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Di Zhao > > > > > > > > > > > > > > > > > > ---- > > > > > > > > > > > > > > > > > > PR tree-optimization/110279 > > > > > > > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > > > > > > > * doc/invoke.texi: Description of > > param_max_fma_chain_len. > > > > > > > > > * params.opt: New parameter param_max_fma_chain_len. > > > > > > > > > * tree-ssa-reassoc.cc (get_reassociation_width): > > > > > > > > > Support param_max_fma_chain_len; check for loop > > dependent > > > > > > > > > FMAs. > > > > > > > > > (rank_ops_for_fma): Return the number of MULT_EXPRs. > > > > > > > > > (reassociate_bb): For 3 ops, refine the condition to > > call > > > > > > > > > swap_ops_for_binary_stmt. > > > > > > > > > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > > > > > > > > > * gcc.dg/pr110279-1.c: New test. > > > > > > > > > * gcc.dg/pr110279-2.c: New test. > > > > > > > > > * gcc.dg/pr110279-3.c: New test. > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > PR tree-optimization/110279 > > > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > > > * tree-ssa-reassoc.cc (get_reassociation_width): check > > > > > > > for loop dependent FMAs. > > > > > > > (reassociate_bb): For 3 ops, refine the condition to call > > > > > > > swap_ops_for_binary_stmt. > > > > > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > > > > > * gcc.dg/pr110279-1.c: New test. > > > > > --- > > > > > > > > > > PR tree-optimization/110279 > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > * common.opt: New flag fully-pipelined-fma. > > > > > * tree-ssa-reassoc.cc (get_mult_latency_consider_fma): > > > > > Return latency of MULT_EXPRs that can't be hided by FMA. > > > > > (get_reassociation_width): Search for smaller widths > > > > > considering the benefit of fully pipelined FMA. > > > > > (rank_ops_for_fma): Return the number of MULT_EXPRs. > > > > > (reassociate_bb): Pass the number of MULT_EXPRs to > > > > > get_reassociation_width; avoid calling > > > > > get_reassociation_width twice. > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > * gcc.dg/pr110279-2.c: New test. > > > > > > Thanks, > > > Di > > > > > > --- > > > > > > PR tree-optimization/110279 > > > > > > gcc/ChangeLog: > > > > > > * doc/invoke.texi: New parameter fully-pipelined-fma. > > > * params.opt: New parameter fully-pipelined-fma. > > > * tree-ssa-reassoc.cc (get_mult_latency_consider_fma): Return > > > the latency of MULT_EXPRs that can't be hidden by the FMAs. > > > (get_reassociation_width): Search for a smaller width > > > considering the benefit of fully pipelined FMA. > > > (rank_ops_for_fma): Return the number of MULT_EXPRs. > > > (reassociate_bb): Pass the number of MULT_EXPRs to > > > get_reassociation_width; avoid calling > > > get_reassociation_width twice. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.dg/pr110279-2.c: New test.