Hi, The previous version of this patch tries to solve two problems at the same time. For better clarity, I'll separate them and only deal with the "nested" FMA in this version. I plan to propose another patch in avoiding bad shaped FMA (deferring FMA).
Other changes:
1. Added new testcases for the "nested" FMA issue. For the
following code:
tmp1 = a + c * c + d * d + x * y;
tmp2 = x * tmp1;
result += (a + c + d + tmp2);
, when "tmp1 = ..." is not rewritten, tmp1 will be result of
an FMA, and there will be a list of consecutive FMAs:
_1 = .FMA (c, c, a_39);
_2 = .FMA (d, d, _1);
tmp1 = .FMA (x, y, _2);
_3 = .FMA (tmp1, x, d);
...
If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
of a PLUS_EXPR between FMAs:
_1 = .FMA (c, c, a_39);
_2 = x * y;
_3 = .FMA (d, d, _2);
tmp1 = _3 + _1;
_4 = .FMA (tmp1, x, d);
...
It seems the register pressure of the latter is higher than
the former. On the test machines we have (including Ampere1,
Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
parallel, the run time all increased significantly. In
contrast, when "tmp1" is not the 1st or 2nd operand of another
FMA (pr110279-1.c), rewriting it results in better performance.
(I'll also append the testcases in the bug tracker.)
2. Enhanced checking for nested FMA by: 1) Modified
convert_mult_to_fma so it can return multiple LHS. 2) Check
NEGATE_EXPRs for nested FMA.
(I think maybe this can be further refined by enabling rewriting
to parallel for very long op list. )
Bootstrapped and regression tested on x86_64-linux-gnu.
Thanks,
Di Zhao
----
PR tree-optimization/110279
gcc/ChangeLog:
* tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
new parameter collect_lhs.
(struct fma_transformation_info): Moved to header.
(class fma_deferring_state): Moved to header.
(convert_mult_to_fma): Added new parameter collect_lhs.
* tree-ssa-math-opts.h (struct fma_transformation_info):
(class fma_deferring_state): Moved from .cc.
(convert_mult_to_fma): Moved from .cc.
* tree-ssa-reassoc.cc (enum fma_state): Defined enum to
describe the state of FMA candidates for a list of
operands.
(rewrite_expr_tree_parallel): Changed boolean parameter
to enum type.
(has_nested_fma_p): New function to check for nested FMA
on given multiplication statement.
(rank_ops_for_fma): Return enum fma_state.
(reassociate_bb): Avoid rewriting to parallel if nested
FMAs are found.
gcc/testsuite/ChangeLog:
* gcc.dg/pr110279-1.c: New test.
* gcc.dg/pr110279-2.c: New test.
tree-optimization-110279-Check-for-nested-FMA-in-rea.patch
Description: tree-optimization-110279-Check-for-nested-FMA-in-rea.patch
