https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119482

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #16)
> r15-9175 on x86_64 now shows (with release checking, built with GCC 7, not
> bootstrapped):
> 
> Samples: 112K of event 'cycles:Pu', Event count (approx.): 139897585915     
> 
> Overhead       Samples  Command  Shared Object         Symbol               
> 
>    6.42%          7028  cc1plus  cc1plus               [.]
> bitmap_and_into(bitma
>    6.19%          6851  cc1plus  cc1plus               [.]
> bitmap_list_insert_el
>    3.96%          4408  cc1plus  cc1plus               [.]
> bitmap_ior_into(bitma
>    3.28%          3695  cc1plus  cc1plus               [.]
> bitmap_set_bit(bitmap
>    2.75%          3099  cc1plus  cc1plus               [.]
> get_ref_base_and_exte
>    2.31%          2532  cc1plus  cc1plus               [.]
> bitmap_and(bitmap_hea
>    1.85%          2052  cc1plus  cc1plus               [.]
> bitmap_ior_and_compl(
>    1.83%          2027  cc1plus  cc1plus               [.]
> bitmap_copy(bitmap_he
>    1.32%          1482  cc1plus  cc1plus               [.]
> bitmap_bit_p(bitmap_h
>    1.24%          1439  cc1plus  cc1plus               [.]
> bitmap_set_range(bitm
>    1.18%          1329  cc1plus  cc1plus               [.]
> lra_create_live_range
> 
> I'll note the -ftime-report is quite flat, DF is sticking out (every RTL
> pass hits on this) as well as LRA (for the same reason, plus
> lra_create_live_ranges).
> 
> The bitmap_list_insert_element_after hit is actually mispredictions on
> bitmap_element_allocate wrt the en-block queued free elements from
> bitmap_elt_clear_from:
> 
>       element = bit_obstack->elements;
> 
>       if (element)
>         /* Use up the inner list first before looking at the next
>            element of the outer list.  */
>         if (element->next)
>           {
>             bit_obstack->elements = element->next;
> 
> possibly having two free element lists, one for singletons and one for
> lists would be friendlier, thus
> 
>       if (bit_obstack->elements)
>         ...
>       else if (bit_obstack->multi_elements)
>         ...

And this one is likely hammered through df_rd_transfer_function when using
DF_RD_PRUNE_DEAD_DEFS (that saves memory on the RD problem, but costs) doing

  if (df->changeable_flags & DF_RD_PRUNE_DEAD_DEFS)
    {
      /* Create a mask of DEFs for all registers live at the end of this
         basic block, and mask out DEFs of registers that are not live.
         Computing the mask looks costly, but the benefit of the pruning
         outweighs the cost.  */
      class df_rd_bb_info *bb_info = df_rd_get_bb_info (bb_index);
      bitmap regs_live_out = &df_lr_get_bb_info (bb_index)->out;
      bitmap live_defs = BITMAP_ALLOC (&df_bitmap_obstack);
      unsigned int regno;
      bitmap_iterator bi;

      EXECUTE_IF_SET_IN_BITMAP (regs_live_out, 0, regno, bi)
        bitmap_set_range (live_defs,
                          DF_DEFS_BEGIN (regno),
                          DF_DEFS_COUNT (regno));
      changed |= bitmap_and_into (&bb_info->out, live_defs);
      BITMAP_FREE (live_defs);

this populates a temporary bitmap from a compressed other bitmap (also
not 1:1 ordered) and prunes the 'out' set with it.  When we need to
iterate (not sure if necessary here), the temporary bitmap will be
computed multiple times (it's invariant duing the RD compute, but
would eat quite some memory when kept for all BBs).  More efficient
construction of the temporary bitmap might help like creating a vector of
regs_live_out and sort it after DF_DEFS_BEGIN and create the bitmap
from that.  Or with a vector of sorted pairs (start, count/end) implement
bitmap_and_vec_into, eschewing the temporary bitmap.  Making
bitmap_set_range handle the tree view might be another option, built the
bitmap in tree view and then turn it back to list view for the AND.

Reply via email to