https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> --- On Wed, 3 Nov 2021, aldyh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943 > > Aldy Hernandez <aldyh at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |amacleod at redhat dot com > > --- Comment #10 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > I tried all sorts of knobs limiting the behavior for large BBs (one function > has over 20k blocks), a large number of imports (dependencies on the final > conditional), and even the max number of blocks to look back. None of them > made a difference. > > Then I realized that this PR was originally reported against the hybrid VRP > threader, which used a different path discovery engine altogether (the old > forward threader). So, the problem can't be in the backward threader path > discovery bits, but in something the solver is doing. > > I timed all the threaders using the solver by functionality (simple versus > fully resolving mode): > > backwards simple : 4.85 ( 2%) 0.00 ( 0%) 4.84 ( > 2%) > 932k ( 0%) > backwards full : 54.60 ( 17%) 0.01 ( 1%) 54.70 ( > 17%) > 664k ( 0%) > > This confirms my hypothesis that it's not the backward threader discovery > bits, > since the above two entries use the same engine. So clearly, it's something > that the fully resolving threader does that was common with the hybrid > threader, i.e. our use of the ranger. > > A callgrind session shows that the majority of the back threader's time is > being spent in: > > path_range_query::range_on_path_entry (irange &r, tree name) > > ...which is understandable, because when we can't resolve an SSA within the > path, we ask the ranger what the range on entry to the path is. > > Curiously though, most of the time is spent in propagate_cache, especially > add_to_update, which is accounting for 37.5% of the threader's time: > > - if (!bitmap_bit_p (m_propfail, bb->index) && !m_update_list.contains (bb)) > - m_update_list.quick_push (bb); > > This is a large CFG, so a linear search of a BB, is bound to be slow. Indeed, vec should never have gotten ::contains () ... I'd have used a regular bitmap, not sbitmap, because we do bb = m_update_list.pop (); and bitmap_first_set_bit is O(n) for an sbitmap bit O(1) for a bitmap. > Just > replacing it with an sbitmap knocks a good 12 seconds: > > backwards jump threading : 48.40 ( 28%) 0.02 ( 1%) 48.57 ( > 27%) > 1597k ( 0%) > backwards jump threading : 32.96 ( 22%) 0.09 ( 4%) 33.12 ( > 22%) > 1499k ( 0%) > > Not ideal, but a good improvement IMO. > > I'll post my proposed patch, but I suspect Andrew may have other tricks up his > sleeve. > >