On Mon, 19 Feb 2024, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > The following tries to address the PHI insertion compile-time hog in > > RTL fwprop observed with the PR54052 testcase where the loop computing > > the "unfiltered" set of variables possibly needing PHI nodes for each > > block exhibits quadratic compile-time and memory-use. > > > > Instead of only pruning the set of candidate regs by LR_IN in the > > second worklist loop do this when computing "unfiltered" already. > > > > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. > > > > I'll note that in PR98863 you say in comment#39 > > > > "Just to give an update on this: I have a patch that reduces the > > amount of memory consumed by fwprop so that it no longer seems > > to be outlier. However, it involves doing more bitmap operations. > > In this testcase we have a larger number of registers that > > seem to be live but unused across a large region of code, > > so bitmap ANDs with the live in sets are expensive and hit > > the worst-case O(nblocksnregisters). I'm still trying to find > > a way of reducing the effect of that." > > > > suggesting that the very AND operation I'm introducing below > > was an actual problem. It's just not very clear what testcase > > this was on (the PR hasn't one, it just talks about WRF with LTO > > and then some individual TUs of it). > > Yeah, like you say, I think this kind of AND was exactly the problem. > If the DEF set is much smaller than the IN set, we can spend a lot of > compile time (and cache) iterating over the leading elements of the > IN set. So this could be trading one hog for another. > > Could we use some heuristic to choose between the two? If the IN set > is "sensible", do the AND, otherwise keep it as-is?
Not sure how, I don't think DF caches the set sizes (we could compute them, of course). But I just made an experiment and using DF_LR_OUT instead of DF_LR_BB_INFO->def improves compile-time as well. So incremental ontop of the posted: diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc index 9d1cd1b0365..6fd08602c1b 100644 --- a/gcc/rtl-ssa/blocks.cc +++ b/gcc/rtl-ssa/blocks.cc @@ -645,12 +645,11 @@ function_info::place_phis (build_info &bi) if (bitmap_empty_p (&frontiers[b1])) continue; - bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, b1))->def; + bitmap b1_def = DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1)); bitmap_iterator bmi; unsigned int b2; EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi) - if (bitmap_ior_and_into (&unfiltered[b2], b1_def, - DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2))) + if (bitmap_ior_into (&unfiltered[b2], b1_def) && !bitmap_empty_p (&frontiers[b2])) // Propagate the (potential) new phi node definitions in B2. bitmap_set_bit (worklist, b2); of course that's too big (including live-through), but we could prune by DF_LR_OUT like diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc index 9d1cd1b0365..6a4dd05908f 100644 --- a/gcc/rtl-ssa/blocks.cc +++ b/gcc/rtl-ssa/blocks.cc @@ -645,12 +645,13 @@ function_info::place_phis (build_info &bi) if (bitmap_empty_p (&frontiers[b1])) continue; - bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, b1))->def; + auto_bitmap b1_def; + bitmap_and (b1_def, &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, b1))->def, + DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1))); bitmap_iterator bmi; unsigned int b2; EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi) - if (bitmap_ior_and_into (&unfiltered[b2], b1_def, - DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2))) + if (bitmap_ior_into (&unfiltered[b2], b1_def) && !bitmap_empty_p (&frontiers[b2])) // Propagate the (potential) new phi node definitions in B2. bitmap_set_bit (worklist, b2); so for the testcase it seems we have a lot of local defined but not "global" used defs. Would that work for you or am I missing something? Thanks, Richard.