This is a new version of the patch on "nested FMA".
Sorry for updating this after so long, I've been studying and
writing micro cases to sort out the cause of the regression.

First, following previous discussion:
(https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)

1. From testing more altered cases, I don't think the
problem is that reassociation works locally. In that:

  1) On the example with multiplications:
        
        tmp1 = a + c * c + d * d + x * y;
        tmp2 = x * tmp1;
        result += (a + c + d + tmp2);

  Given "result" rewritten by width=2, the performance is
  worse if we rewrite "tmp1" with width=2. In contrast, if we
  remove the multiplications from the example (and make "tmp1"
  not singe used), and still rewrite "result" by width=2, then
  rewriting "tmp1" with width=2 is better. (Make sense because
  the tree's depth at "result" is still smaller if we rewrite
  "tmp1".)

  2) I tried to modify the assembly code of the example without
  FMA, so the width of "result" is 4. On Ampere1 there's no
  obvious improvement. So although this is an interesting
  problem, it doesn't seem like the cause of the regression.

2. From assembly code of the case with FMA, one problem is
that, rewriting "tmp1" to parallel didn't decrease the
minimum CPU cycles (taking MULT_EXPRs into account), but
increased code size, so the overhead is increased.

   a) When "tmp1" is not re-written to parallel:
        fmadd d31, d2, d2, d30
        fmadd d31, d3, d3, d31
        fmadd d31, d4, d5, d31  //"tmp1"                
        fmadd d31, d31, d4, d3

   b) When "tmp1" is re-written to parallel:
        fmul  d31, d4, d5      
        fmadd d27, d2, d2, d30 
        fmadd d31, d3, d3, d31 
        fadd  d31, d31, d27     //"tmp1"
        fmadd d31, d31, d4, d3

For version a), there are 3 dependent FMAs to calculate "tmp1".
For version b), there are also 3 dependent instructions in the
longer path: the 1st, 3rd and 4th.

So it seems to me the current get_reassociation_width algorithm
isn't optimal in the presence of FMA. So I modified the patch to
improve get_reassociation_width, rather than check for code
patterns. (Although there could be some other complicated
factors so the regression is more obvious when there's "nested
FMA". But with this patch that should be avoided or reduced.)

With this patch 508.namd_r 1-copy run has 7% improvement on
Ampere1, on Intel Xeon there's about 3%. While I'm still
collecting data on other CPUs, I'd like to know how do you
think of this.

About changes in the patch:

1. When the op list forms a complete FMA chain, try to search
for a smaller width considering the benefit of using FMA. With
a smaller width, the increment of code size is smaller when
breaking the chain.

2. To avoid regressions, included the other patch
(https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
on this tracker again. This is because more FMA will be kept
with 1., so we need to rule out the loop dependent
FMA chains when param_avoid_fma_max_bits is set.

Thanks,
Di Zhao

----

        PR tree-optimization/110279

gcc/ChangeLog:

        * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
        New function to check whether ranking the ops results in
        better parallelism.
        (get_reassociation_width): Add new parameters. Search for
        smaller width considering the benefit of FMA.
        (rank_ops_for_fma): Change return value to be number of
        MULT_EXPRs.
        (reassociate_bb): For 3 ops, refine the condition to call
        swap_ops_for_binary_stmt.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279.c: New test.

Attachment: 0001-Consider-FMA-in-get_reassociation_width.patch
Description: 0001-Consider-FMA-in-get_reassociation_width.patch

Reply via email to