On Sat, Sep 29, 2012 at 10:26 PM, Steven Bosscher <stevenb....@gmail.com> wrote: > Hi Vlad, > > Thanks for the testing and the logs. You must have good hardware, your > timings are all ~3 times faster than mine :-) > > On Sat, Sep 29, 2012 at 3:01 AM, Vladimir Makarov <vmaka...@redhat.com> wrote: >> ----------------------------------32-bit------------------------------------ >> Reload: >> 581.85user 29.91system 27:15.18elapsed 37%CPU (0avgtext+0avgdata >> LRA: >> 629.67user 24.16system 24:31.08elapsed 44%CPU (0avgtext+0avgdata > > This is a ~8% slowdown. > > >> ----------------------------------64-bit:----------------------------------- >> Reload: >> 503.26user 36.54system 30:16.62elapsed 29%CPU (0avgtext+0avgdata >> LRA: >> 598.70user 30.90system 27:26.92elapsed 38%CPU (0avgtext+0avgdata > > This is a ~19% slowdown
I think both measurements run into swap (low CPU utilization), from the LRA numbers I'd say that LRA uses less memory but the timings are somewhat useless with the swapping. >> Here is the numbers for PR54146 on the same machine with -O1 only for >> 64-bit (compiler reports error for -m32). > > Right, the test case is for 64-bits only, I think it's preprocessed > code for AMD64. > >> Reload: >> 350.40user 21.59system 17:09.75elapsed 36%CPU (0avgtext+0avgdata >> LRA: >> 468.29user 21.35system 15:47.76elapsed 51%CPU (0avgtext+0avgdata > > This is a ~34% slowdown. > > To put it in another perspective, here are my timings of trunk vs lra > (both checkouts done today): > > trunk: > integrated RA : 181.68 (24%) usr 1.68 (11%) sys 183.43 > (24%) wall 643564 kB (20%) ggc > reload : 11.00 ( 1%) usr 0.18 ( 1%) sys 11.17 ( > 1%) wall 32394 kB ( 1%) ggc > TOTAL : 741.64 14.76 756.41 > 3216164 kB > > lra branch: > integrated RA : 174.65 (16%) usr 1.33 ( 8%) sys 176.33 > (16%) wall 643560 kB (20%) ggc > reload : 399.69 (36%) usr 2.48 (15%) sys 402.69 > (36%) wall 41852 kB ( 1%) ggc > TOTAL :1102.06 16.05 1120.83 > 3231738 kB > > That's a 49% slowdown. The difference is completely accounted for by > the timing difference between reload and LRA. > (Timings done on gcc17, which is AMD Opteron(tm) Processor 8354 with > 15GB ram, so swapping is no issue.) > > It looks like the reload timevar is used for LRA. Why not have > multiple timevars, one per phase of LRA? Sth like the patch below > would be nice. This gives me the following timings: > > integrated RA : 189.34 (16%) usr 1.84 (11%) sys 191.18 > (16%) wall 643560 kB (20%) ggc > LRA non-specific : 59.82 ( 5%) usr 0.22 ( 1%) sys 60.12 ( > 5%) wall 18202 kB ( 1%) ggc > LRA virtuals eliminatenon: 56.79 ( 5%) usr 0.03 ( 0%) sys 56.80 ( > 5%) wall 19223 kB ( 1%) ggc > LRA reload inheritance : 6.41 ( 1%) usr 0.01 ( 0%) sys 6.42 ( > 1%) wall 1665 kB ( 0%) ggc > LRA create live ranges : 175.30 (15%) usr 2.14 (13%) sys 177.44 > (15%) wall 2761 kB ( 0%) ggc > LRA hard reg assignment : 130.85 (11%) usr 0.20 ( 1%) sys 131.17 > (11%) wall 0 kB ( 0%) ggc > LRA coalesce pseudo regs: 2.54 ( 0%) usr 0.00 ( 0%) sys 2.55 ( > 0%) wall 0 kB ( 0%) ggc > reload : 6.73 ( 1%) usr 0.20 ( 1%) sys 6.92 ( > 1%) wall 0 kB ( 0%) ggc > > so the LRA "slowness" (for lack of a better word) appears to be due to > scalability problems in all sub-passes. It would be nice to see if LRA just has a larger constant cost factor compared to reload or if it has bigger complexity. > The code size changes are impressive, but I think that this kind of > slowdown should be addressed before making LRA the default for any > target. Certainly if it shows bigger complexity, not sure for the constant factor (but for sure improvements are welcome). I suppose there is the option to revert back to reload by default for x86_64 as well for 4.8, right? That is, do both reload and LRA co-exist for each target or is it a definite decision target by target? Thanks, Richard. > Ciao! > Steven > > > > > Index: lra-assigns.c > =================================================================== > --- lra-assigns.c (revision 191858) > +++ lra-assigns.c (working copy) > @@ -1261,6 +1261,8 @@ lra_assign (void) > bitmap_head insns_to_process; > bool no_spills_p; > > + timevar_push (TV_LRA_ASSIGN); > + > init_lives (); > sorted_pseudos = (int *) xmalloc (sizeof (int) * max_reg_num ()); > sorted_reload_pseudos = (int *) xmalloc (sizeof (int) * max_reg_num ()); > @@ -1312,5 +1314,6 @@ lra_assign (void) > free (sorted_pseudos); > free (sorted_reload_pseudos); > finish_lives (); > + timevar_pop (TV_LRA_ASSIGN); > return no_spills_p; > } > Index: lra.c > =================================================================== > --- lra.c (revision 191858) > +++ lra.c (working copy) > @@ -2193,6 +2193,7 @@ lra (FILE *f) > > lra_dump_file = f; > > + timevar_push (TV_LRA); > > init_insn_recog_data (); > > @@ -2271,6 +2272,7 @@ lra (FILE *f) > to use a constant pool. */ > lra_eliminate (false); > lra_inheritance (); > + > /* We need live ranges for lra_assign -- so build them. */ > lra_create_live_ranges (true); > live_p = true; > @@ -2343,6 +2345,8 @@ lra (FILE *f) > #ifdef ENABLE_CHECKING > check_rtl (true); > #endif > + > + timevar_pop (TV_LRA); > } > > /* Called once per compiler to initialize LRA data once. */ > Index: lra-eliminations.c > =================================================================== > --- lra-eliminations.c (revision 191858) > +++ lra-eliminations.c (working copy) > @@ -1297,6 +1297,8 @@ lra_eliminate (bool final_p) > struct elim_table *ep; > int regs_num = max_reg_num (); > > + timevar_push (TV_LRA_ELIMINATE); > + > bitmap_initialize (&insns_with_changed_offsets, ®_obstack); > if (final_p) > { > @@ -1317,7 +1319,7 @@ lra_eliminate (bool final_p) > { > update_reg_eliminate (&insns_with_changed_offsets); > if (bitmap_empty_p (&insns_with_changed_offsets)) > - return; > + goto lra_eliminate_done; > } > if (lra_dump_file != NULL) > { > @@ -1349,4 +1351,7 @@ lra_eliminate (bool final_p) > process_insn_for_elimination (insn, final_p); > } > bitmap_clear (&insns_with_changed_offsets); > + > +lra_eliminate_done: > + timevar_pop (TV_LRA_ELIMINATE); > } > Index: lra-lives.c > =================================================================== > --- lra-lives.c (revision 191858) > +++ lra-lives.c (working copy) > @@ -962,6 +962,8 @@ lra_create_live_ranges (bool all_p) > basic_block bb; > int i, hard_regno, max_regno = max_reg_num (); > > + timevar_push (TV_LRA_CREATE_LIVE_RANGES); > + > complete_info_p = all_p; > if (lra_dump_file != NULL) > fprintf (lra_dump_file, > @@ -1016,6 +1018,7 @@ lra_create_live_ranges (bool all_p) > sparseset_free (pseudos_live_through_setjumps); > sparseset_free (pseudos_live); > compress_live_ranges (); > + timevar_pop (TV_LRA_CREATE_LIVE_RANGES); > } > > /* Finish all live ranges. */ > Index: timevar.def > =================================================================== > --- timevar.def (revision 191858) > +++ timevar.def (working copy) > @@ -223,10 +223,16 @@ DEFTIMEVAR (TV_REGMOVE , " > DEFTIMEVAR (TV_MODE_SWITCH , "mode switching") > DEFTIMEVAR (TV_SMS , "sms modulo scheduling") > DEFTIMEVAR (TV_SCHED , "scheduling") > -DEFTIMEVAR (TV_IRA , "integrated RA") > -DEFTIMEVAR (TV_RELOAD , "reload") > +DEFTIMEVAR (TV_IRA , "integrated RA") > +DEFTIMEVAR (TV_LRA , "LRA non-specific") > +DEFTIMEVAR (TV_LRA_ELIMINATE , "LRA virtuals eliminatenon") > +DEFTIMEVAR (TV_LRA_INHERITANCE , "LRA reload inheritance") > +DEFTIMEVAR (TV_LRA_CREATE_LIVE_RANGES, "LRA create live ranges") > +DEFTIMEVAR (TV_LRA_ASSIGN , "LRA hard reg assignment") > +DEFTIMEVAR (TV_LRA_COALESCE , "LRA coalesce pseudo regs") > +DEFTIMEVAR (TV_RELOAD , "reload") > DEFTIMEVAR (TV_RELOAD_CSE_REGS , "reload CSE regs") > -DEFTIMEVAR (TV_GCSE_AFTER_RELOAD , "load CSE after reload") > +DEFTIMEVAR (TV_GCSE_AFTER_RELOAD , "load CSE after reload") > DEFTIMEVAR (TV_REE , "ree") > DEFTIMEVAR (TV_THREAD_PROLOGUE_AND_EPILOGUE, "thread pro- & epilogue") > DEFTIMEVAR (TV_IFCVT2 , "if-conversion 2") > Index: lra-coalesce.c > =================================================================== > --- lra-coalesce.c (revision 191858) > +++ lra-coalesce.c (working copy) > @@ -221,6 +221,8 @@ lra_coalesce (void) > bitmap_head involved_insns_bitmap, split_origin_bitmap; > bitmap_iterator bi; > > + timevar_push (TV_LRA_COALESCE); > + > if (lra_dump_file != NULL) > fprintf (lra_dump_file, > "\n********** Pseudos coalescing #%d: **********\n\n", > @@ -371,5 +373,6 @@ lra_coalesce (void) > free (sorted_moves); > free (next_coalesced_pseudo); > free (first_coalesced_pseudo); > + timevar_pop (TV_LRA_COALESCE); > return coalesced_moves != 0; > } > Index: lra-constraints.c > =================================================================== > --- lra-constraints.c (revision 191858) > +++ lra-constraints.c (working copy) > @@ -4859,6 +4859,8 @@ lra_inheritance (void) > basic_block bb, start_bb; > edge e; > > + timevar_push (TV_LRA_INHERITANCE); > + > lra_inheritance_iter++; > if (lra_dump_file != NULL) > fprintf (lra_dump_file, "\n********** Inheritance #%d: **********\n\n", > @@ -4907,6 +4909,8 @@ lra_inheritance (void) > bitmap_clear (&live_regs); > bitmap_clear (&check_only_regs); > free (usage_insns); > + > + timevar_pop (TV_LRA_INHERITANCE); > } > > ^L