https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119482
--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Richard Biener from comment #16) > r15-9175 on x86_64 now shows (with release checking, built with GCC 7, not > bootstrapped): > > Samples: 112K of event 'cycles:Pu', Event count (approx.): 139897585915 > > Overhead Samples Command Shared Object Symbol > > 6.42% 7028 cc1plus cc1plus [.] > bitmap_and_into(bitma > 6.19% 6851 cc1plus cc1plus [.] > bitmap_list_insert_el > 3.96% 4408 cc1plus cc1plus [.] > bitmap_ior_into(bitma > 3.28% 3695 cc1plus cc1plus [.] > bitmap_set_bit(bitmap > 2.75% 3099 cc1plus cc1plus [.] > get_ref_base_and_exte > 2.31% 2532 cc1plus cc1plus [.] > bitmap_and(bitmap_hea > 1.85% 2052 cc1plus cc1plus [.] > bitmap_ior_and_compl( > 1.83% 2027 cc1plus cc1plus [.] > bitmap_copy(bitmap_he > 1.32% 1482 cc1plus cc1plus [.] > bitmap_bit_p(bitmap_h > 1.24% 1439 cc1plus cc1plus [.] > bitmap_set_range(bitm > 1.18% 1329 cc1plus cc1plus [.] > lra_create_live_range > > I'll note the -ftime-report is quite flat, DF is sticking out (every RTL > pass hits on this) as well as LRA (for the same reason, plus > lra_create_live_ranges). > > The bitmap_list_insert_element_after hit is actually mispredictions on > bitmap_element_allocate wrt the en-block queued free elements from > bitmap_elt_clear_from: > > element = bit_obstack->elements; > > if (element) > /* Use up the inner list first before looking at the next > element of the outer list. */ > if (element->next) > { > bit_obstack->elements = element->next; > > possibly having two free element lists, one for singletons and one for > lists would be friendlier, thus > > if (bit_obstack->elements) > ... > else if (bit_obstack->multi_elements) > ... And this one is likely hammered through df_rd_transfer_function when using DF_RD_PRUNE_DEAD_DEFS (that saves memory on the RD problem, but costs) doing if (df->changeable_flags & DF_RD_PRUNE_DEAD_DEFS) { /* Create a mask of DEFs for all registers live at the end of this basic block, and mask out DEFs of registers that are not live. Computing the mask looks costly, but the benefit of the pruning outweighs the cost. */ class df_rd_bb_info *bb_info = df_rd_get_bb_info (bb_index); bitmap regs_live_out = &df_lr_get_bb_info (bb_index)->out; bitmap live_defs = BITMAP_ALLOC (&df_bitmap_obstack); unsigned int regno; bitmap_iterator bi; EXECUTE_IF_SET_IN_BITMAP (regs_live_out, 0, regno, bi) bitmap_set_range (live_defs, DF_DEFS_BEGIN (regno), DF_DEFS_COUNT (regno)); changed |= bitmap_and_into (&bb_info->out, live_defs); BITMAP_FREE (live_defs); this populates a temporary bitmap from a compressed other bitmap (also not 1:1 ordered) and prunes the 'out' set with it. When we need to iterate (not sure if necessary here), the temporary bitmap will be computed multiple times (it's invariant duing the RD compute, but would eat quite some memory when kept for all BBs). More efficient construction of the temporary bitmap might help like creating a vector of regs_live_out and sort it after DF_DEFS_BEGIN and create the bitmap from that. Or with a vector of sorted pairs (start, count/end) implement bitmap_and_vec_into, eschewing the temporary bitmap. Making bitmap_set_range handle the tree view might be another option, built the bitmap in tree view and then turn it back to list view for the AND.