On Mon, 19 Feb 2024, Richard Sandiford wrote:
> Richard Biener <[email protected]> writes:
> > The following tries to address the PHI insertion compile-time hog in
> > RTL fwprop observed with the PR54052 testcase where the loop computing
> > the "unfiltered" set of variables possibly needing PHI nodes for each
> > block exhibits quadratic compile-time and memory-use.
> >
> > Instead of only pruning the set of candidate regs by LR_IN in the
> > second worklist loop do this when computing "unfiltered" already.
> >
> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >
> > I'll note that in PR98863 you say in comment#39
> >
> > "Just to give an update on this: I have a patch that reduces the
> > amount of memory consumed by fwprop so that it no longer seems
> > to be outlier. However, it involves doing more bitmap operations.
> > In this testcase we have a larger number of registers that
> > seem to be live but unused across a large region of code,
> > so bitmap ANDs with the live in sets are expensive and hit
> > the worst-case O(nblocksnregisters). I'm still trying to find
> > a way of reducing the effect of that."
> >
> > suggesting that the very AND operation I'm introducing below
> > was an actual problem. It's just not very clear what testcase
> > this was on (the PR hasn't one, it just talks about WRF with LTO
> > and then some individual TUs of it).
>
> Yeah, like you say, I think this kind of AND was exactly the problem.
> If the DEF set is much smaller than the IN set, we can spend a lot of
> compile time (and cache) iterating over the leading elements of the
> IN set. So this could be trading one hog for another.
>
> Could we use some heuristic to choose between the two? If the IN set
> is "sensible", do the AND, otherwise keep it as-is?
Not sure how, I don't think DF caches the set sizes (we could
compute them, of course). But I just made an experiment and
using DF_LR_OUT instead of DF_LR_BB_INFO->def improves compile-time
as well. So incremental ontop of the posted:
diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
index 9d1cd1b0365..6fd08602c1b 100644
--- a/gcc/rtl-ssa/blocks.cc
+++ b/gcc/rtl-ssa/blocks.cc
@@ -645,12 +645,11 @@ function_info::place_phis (build_info &bi)
if (bitmap_empty_p (&frontiers[b1]))
continue;
- bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn,
b1))->def;
+ bitmap b1_def = DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1));
bitmap_iterator bmi;
unsigned int b2;
EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
- if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
- DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2)))
+ if (bitmap_ior_into (&unfiltered[b2], b1_def)
&& !bitmap_empty_p (&frontiers[b2]))
// Propagate the (potential) new phi node definitions in B2.
bitmap_set_bit (worklist, b2);
of course that's too big (including live-through), but we could
prune by DF_LR_OUT like
diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
index 9d1cd1b0365..6a4dd05908f 100644
--- a/gcc/rtl-ssa/blocks.cc
+++ b/gcc/rtl-ssa/blocks.cc
@@ -645,12 +645,13 @@ function_info::place_phis (build_info &bi)
if (bitmap_empty_p (&frontiers[b1]))
continue;
- bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn,
b1))->def;
+ auto_bitmap b1_def;
+ bitmap_and (b1_def, &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn,
b1))->def,
+ DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1)));
bitmap_iterator bmi;
unsigned int b2;
EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
- if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
- DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2)))
+ if (bitmap_ior_into (&unfiltered[b2], b1_def)
&& !bitmap_empty_p (&frontiers[b2]))
// Propagate the (potential) new phi node definitions in B2.
bitmap_set_bit (worklist, b2);
so for the testcase it seems we have a lot of local defined but
not "global" used defs.
Would that work for you or am I missing something?
Thanks,
Richard.