[Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r

aldyh at gcc dot gnu.org via Gcc-bugs Wed, 03 Nov 2021 06:15:45 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943


Aldy Hernandez <aldyh at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amacleod at redhat dot com

--- Comment #10 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
I tried all sorts of knobs limiting the behavior for large BBs (one function
has over 20k blocks), a large number of imports (dependencies on the final
conditional), and even the max number of blocks to look back.  None of them
made a difference.

Then I realized that this PR was originally reported against the hybrid VRP
threader, which used a different path discovery engine altogether (the old
forward threader).  So, the problem can't be in the backward threader path
discovery bits, but in something the solver is doing.

I timed all the threaders using the solver by functionality (simple versus
fully resolving mode):

backwards simple                   :   4.85 (  2%)   0.00 (  0%)   4.84 (  2%) 
 932k (  0%)
backwards full                     :  54.60 ( 17%)   0.01 (  1%)  54.70 ( 17%) 
 664k (  0%)

This confirms my hypothesis that it's not the backward threader discovery bits,
since the above two entries use the same engine.  So clearly, it's something
that the fully resolving threader does that was common with the hybrid
threader, i.e. our use of the ranger.

A callgrind session shows that the majority of the back threader's time is
being spent in:

  path_range_query::range_on_path_entry (irange &r, tree name)

...which is understandable, because when we can't resolve an SSA within the
path, we ask the ranger what the range on entry to the path is.

Curiously though, most of the time is spent in propagate_cache, especially
add_to_update, which is accounting for 37.5% of the threader's time:

-  if (!bitmap_bit_p (m_propfail, bb->index) &&  !m_update_list.contains (bb))
-    m_update_list.quick_push (bb);

This is a large CFG, so a linear search of a BB, is bound to be slow.  Just
replacing it with an sbitmap knocks a good 12 seconds:

 backwards jump threading           :  48.40 ( 28%)   0.02 (  1%)  48.57 ( 27%)
 1597k (  0%)
 backwards jump threading           :  32.96 ( 22%)   0.09 (  4%)  33.12 ( 22%)
 1499k (  0%)

Not ideal, but a good improvement IMO.

I'll post my proposed patch, but I suspect Andrew may have other tricks up his
sleeve.

[Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r

Reply via email to