Hi,

The previous version of this patch tries to solve two problems
at the same time. For better clarity, I'll separate them and 
only deal with the "nested" FMA in this version. I plan to
propose another patch in avoiding bad shaped FMA (deferring FMA).

Other changes:

1. Added new testcases for the "nested" FMA issue. For the
   following code:

        tmp1 = a + c * c + d * d + x * y;
        tmp2 = x * tmp1;
        result += (a + c + d + tmp2);

   , when "tmp1 = ..." is not rewritten, tmp1 will be result of
   an FMA, and there will be a list of consecutive FMAs: 

        _1 = .FMA (c, c, a_39);
        _2 = .FMA (d, d, _1);
        tmp1 = .FMA (x, y, _2);
        _3 = .FMA (tmp1, x, d);
        ...
   
   If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
   of a PLUS_EXPR between FMAs:

        _1 = .FMA (c, c, a_39);
        _2 = x * y;
        _3 = .FMA (d, d, _2);
         tmp1 = _3 + _1;
         _4 = .FMA (tmp1, x, d);
        ...

   It seems the register pressure of the latter is higher than
   the former. On the test machines we have (including Ampere1,
   Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
   parallel, the run time all increased significantly. In
   contrast, when "tmp1" is not the 1st or 2nd operand of another
   FMA (pr110279-1.c), rewriting it results in better performance.
   (I'll also append the testcases in the bug tracker.)

2. Enhanced checking for nested FMA by: 1) Modified
   convert_mult_to_fma so it can return multiple LHS.  2) Check
   NEGATE_EXPRs for nested FMA.

(I think maybe this can be further refined by enabling rewriting
to parallel for very long op list. )

Bootstrapped and regression tested on x86_64-linux-gnu.

Thanks,
Di Zhao

----

        PR tree-optimization/110279

gcc/ChangeLog:

        * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
        new parameter collect_lhs.
        (struct fma_transformation_info): Moved to header.
        (class fma_deferring_state): Moved to header.
        (convert_mult_to_fma): Added new parameter collect_lhs.
        * tree-ssa-math-opts.h (struct fma_transformation_info):
        (class fma_deferring_state): Moved from .cc.
        (convert_mult_to_fma): Moved from .cc.
        * tree-ssa-reassoc.cc (enum fma_state): Defined enum to
        describe the state of FMA candidates for a list of
        operands.
        (rewrite_expr_tree_parallel): Changed boolean parameter
        to enum type.
        (has_nested_fma_p): New function to check for nested FMA
        on given multiplication statement.
        (rank_ops_for_fma): Return enum fma_state.
        (reassociate_bb): Avoid rewriting to parallel if nested
        FMAs are found.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279-1.c: New test.
        * gcc.dg/pr110279-2.c: New test.

Attachment: tree-optimization-110279-Check-for-nested-FMA-in-rea.patch
Description: tree-optimization-110279-Check-for-nested-FMA-in-rea.patch

Reply via email to