Re: [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time hog

Richard Biener Mon, 19 Feb 2024 03:32:54 -0800

On Mon, 19 Feb 2024, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > The following tries to address the PHI insertion compile-time hog in
> > RTL fwprop observed with the PR54052 testcase where the loop computing
> > the "unfiltered" set of variables possibly needing PHI nodes for each
> > block exhibits quadratic compile-time and memory-use.
> >
> > Instead of only pruning the set of candidate regs by LR_IN in the
> > second worklist loop do this when computing "unfiltered" already.
> >
> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >
> > I'll note that in PR98863 you say in comment#39
> >
> > "Just to give an update on this: I have a patch that reduces the
> > amount of memory consumed by fwprop so that it no longer seems
> > to be outlier.  However, it involves doing more bitmap operations.
> > In this testcase we have a larger number of registers that
> > seem to be live but unused across a large region of code,
> > so bitmap ANDs with the live in sets are expensive and hit
> > the worst-case O(nblocksnregisters).  I'm still trying to find
> > a way of reducing the effect of that."
> >
> > suggesting that the very AND operation I'm introducing below
> > was an actual problem.  It's just not very clear what testcase
> > this was on (the PR hasn't one, it just talks about WRF with LTO
> > and then some individual TUs of it).
> 
> Yeah, like you say, I think this kind of AND was exactly the problem.
> If the DEF set is much smaller than the IN set, we can spend a lot of
> compile time (and cache) iterating over the leading elements of the
> IN set.  So this could be trading one hog for another.
> 
> Could we use some heuristic to choose between the two?  If the IN set
> is "sensible", do the AND, otherwise keep it as-is?


Not sure how, I don't think DF caches the set sizes (we could
compute them, of course).  But I just made an experiment and
using DF_LR_OUT instead of DF_LR_BB_INFO->def improves compile-time
as well.  So incremental ontop of the posted:

diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
index 9d1cd1b0365..6fd08602c1b 100644
--- a/gcc/rtl-ssa/blocks.cc
+++ b/gcc/rtl-ssa/blocks.cc
@@ -645,12 +645,11 @@ function_info::place_phis (build_info &bi)
       if (bitmap_empty_p (&frontiers[b1]))
        continue;
 
-      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
b1))->def;
+      bitmap b1_def = DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1));
       bitmap_iterator bmi;
       unsigned int b2;
       EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
-       if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
-                                DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2)))
+       if (bitmap_ior_into (&unfiltered[b2], b1_def)
            && !bitmap_empty_p (&frontiers[b2]))
          // Propagate the (potential) new phi node definitions in B2.
          bitmap_set_bit (worklist, b2);

of course that's too big (including live-through), but we could
prune by DF_LR_OUT like

diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
index 9d1cd1b0365..6a4dd05908f 100644
--- a/gcc/rtl-ssa/blocks.cc
+++ b/gcc/rtl-ssa/blocks.cc
@@ -645,12 +645,13 @@ function_info::place_phis (build_info &bi)
       if (bitmap_empty_p (&frontiers[b1]))
        continue;
 
-      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
b1))->def;
+      auto_bitmap b1_def;
+      bitmap_and (b1_def, &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
b1))->def,
+                 DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1)));
       bitmap_iterator bmi;
       unsigned int b2;
       EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
-       if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
-                                DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2)))
+       if (bitmap_ior_into (&unfiltered[b2], b1_def)
            && !bitmap_empty_p (&frontiers[b2]))
          // Propagate the (potential) new phi node definitions in B2.
          bitmap_set_bit (worklist, b2);

so for the testcase it seems we have a lot of local defined but
not "global" used defs.

Would that work for you or am I missing something?

Thanks,
Richard.

Re: [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time hog

Reply via email to