Richard Sandiford <richard.sandif...@arm.com> writes: > Really nice analysis! Thanks for writing this up. > > Sorry for the big quote below, but: > > Jan Hubicka <hubi...@ucw.cz> writes: >> [...] >> PR117081 is about regression in povray. The reducted testcase: >> >> void foo (void); >> void bar (void); >> >> int >> test (int a) >> { >> int r; >> >> if (r = -a) >> foo (); >> else >> bar (); >> >> return r; >> } >> >> shows that we now use caller saved register (EAX) to hold the return value >> which yields longer code. The costs are >> Popping a0(r98,l0) -- (0=13000,13000) (3=15000,15000) (6=15000,15000) >> (40=15000,15000) (41=15000,15000) (42=15000,15000) (43=15000,15000) >> >> here 15000 is 11000+4000 where I think 4000 is cost of 2 reg-reg moves >> multiplied by REG_FREQ_MAX. This seems correct. GCC 13 uses callee >> saved register and produces: >> >> 0000000000000000 <test>: >> 0: 53 push %rbx <--- callee save >> 1: 89 fb mov %edi,%ebx <--- move 1 >> 3: f7 db neg %ebx >> 5: 74 09 je 10 <test+0x10> >> 7: e8 00 00 00 00 call c <test+0xc> >> c: 89 d8 mov %ebx,%eax <--- callee >> restore >> e: 5b pop %rbx >> f: c3 ret >> 10: e8 00 00 00 00 call 15 <test+0x15> >> 15: 89 d8 mov %ebx,%eax <--- move 2 >> 17: 5b pop %rbx <--- callee >> restore >> 18: c3 ret >> >> Mainline used EAX since it has costs 13000. It is not 100% clear to me >> why. >> - 12000 is the spilling (which is emitted twice but executed just once) >> - I would have expected 2000 for the move from edi to eax. >> However even if cost is 14000 we will choose EAX. The code is: >> >> 0: 89 f8 mov %edi,%eax <--- move1 >> 2: 48 83 ec 18 sub $0x18,%rsp <--- stack >> frame creation >> 6: f7 d8 neg %eax >> 8: 89 44 24 0c mov %eax,0xc(%rsp) <--- spill out >> c: 85 ff test %edi,%edi >> e: 74 10 je 20 <test+0x20> >> 10: e8 00 00 00 00 call 15 <test+0x15> >> 15: 8b 44 24 0c mov 0xc(%rsp),%eax <--- spill in >> 19: 48 83 c4 18 add $0x18,%rsp <--- stack frame >> 1d: c3 ret >> 1e: 66 90 xchg %ax,%ax >> 20: e8 00 00 00 00 call 25 <test+0x25> >> 25: 8b 44 24 0c mov 0xc(%rsp),%eax <--- spill in >> 29: 48 83 c4 18 add $0x18,%rsp <--- stack frame >> 2d: c3 ret >> >> This sequence really saves one move at the expense of of stack frame >> allocation (which is not modelled by the cost model) and longer spill >> code (also no modelled). >> [...] >> PR117082 is about noreturn function: >> __attribute__ ((noreturn)) >> void >> f3 (void) >> { >> int y0 = x0; >> int y1 = x1; >> f1 (); >> f2 (y0, y1); >> while (1); >> } >> >> Here the cost model is really wrong by assuming that entry and exit >> block have same frequencies. This can be fixed quite easilly (though it >> is a rare case) > > Yeah (and nice example). > >> PR118497 seems to be ixed. >> >> So overall I think >> 1) we can fix scaling of epilogue by exit block frequency >> to get noreturns right. >> 2) we should drop the check for optimize_size. Since with -Os >> REG_FREQ_FROM_BB always returns 1000 everything should be scaled >> same way >> 3) we currently have wire in "-1" to biass the cost metric for callee >> saved registers. >> It may make sense to allow targets to control this, since i.e. x86 >> has push/pop that is shorter. -3 would solve the testcase with neg >> and would express that push/pop is still cheaper with extra reg-reg >> move. >> 4) cost model misses shring wrapping, the fact that if register is >> callee saved it may be used by multiple allocnos and also that >> push/pop sequence may avoid need for manual RSP adjustments. >> >> Those seems bit harder things to fit in though. >> >> So if we want to go with the target hook, I think it should adjust the >> cost before scalling (since targets may have special tricks for >> prologues) rather than the scale factor (which is target independent >> part of cost model). > > Like you say, one of the missing pieces appears to be the allocation/ > dealloaction overhead for callee saves. Could we try to add a hook to > model that cost, based on certain parameters? > > In particular, one thing that the examples above have in common is that > they don't need to allocate a frame for local variables. That seems > like it ought to be part of the mix. If we need to allocate a frame > using addition anyway, then presumably one of the advantages of push/pop > over callee saves goes away. > > But going back to my question above, modelling the allocation and > deallocation would need to be done in a way that makes them more > expensive than moves. I think in some cases we've assumed that > addition and moves have equal cost, even when optimising for speed. > > In other words, rather than add a hook to tweak the -1 bias (which I > still agree is a more reasonable thing to do than bypassing the IRA > code altogether), could we add a separate hook for the allocation/ > deallocation and leave IRA to work out when to apply it? > > I suppose for -fno-omit-frame-pointer we should also take the creation > of the frame link into account, if we don't already. This matters on > aarch64 since -fomit-frame-pointer is not the default at -O2. > > One quirk on aarch64 is that, if we're using an odd number of > callee-saved GPRs, adding one more is essentially free, since we would > allocate space for it anyway, and could save and restore it alongside > the odd one out. That would probably be difficult to apply in practice > though. And it's obviously a separate issue from the current one. > Just mentioning it as another thing we could model in future.
FWIW, here's a very rough initial version of the kind of thing I was thinking about. Hopefully the hook documentation describes the approach. It's deliberately (overly?) flexible. I've included an aarch64 version that (a) models the fact that the first caller-save can also allocate the frame more-or-less for free, and (b) once we've saved an odd number of GPRs, saving one more is essentialy free. I also hacked up an x86 version locally to model the allocation benefits of using caller-saved registers. It seemed to fix the povray example above. This still needs a lot of clean-up and testing, but I thought I might as well send what I have before leaving for the weekend. Does it look reasonable in principle? Thanks, Richard diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index c1e40200806..3f9453a74fb 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -15870,6 +15870,60 @@ aarch64_memory_move_cost (machine_mode mode, reg_class_t rclass_i, bool in) : base + aarch64_tune_params.memmov_cost.store_int); } +/* CALLEE_SAVED_REGS is the set of callee-saved registers that the + RA has already decided to use. Return the total number of GPRs + that need to be saved and restored, including the frame link registers. */ +static int +aarch64_count_gpr_saves (const HARD_REG_SET &callee_saved_regs) +{ + auto saved_gprs = callee_saved_regs & reg_class_contents[GENERAL_REGS]; + /* FIXME: popcount hack. */ + auto nregs = popcount_hwi (saved_gprs.elts[0]); + + if (aarch64_needs_frame_chain ()) + nregs += 2; + else if (!crtl->is_leaf || df_regs_ever_live_p (LR_REGNUM)) + nregs += 1; + return nregs; +} + +/* Implement TARGET_CALLEE_SAVE_COST. */ +static int +aarch64_callee_save_cost (spill_cost_type, unsigned int regno, + machine_mode, unsigned int nregs, int mem_cost, + const HARD_REG_SET &callee_saved_regs, + bool existing_spill_p) +{ + auto ngprs = aarch64_count_gpr_saves (callee_saved_regs); + + /* FIXME: handle FP and Advanced SIMD registers. */ + + /* If we've already committed to saving an odd number of GPRs, assume that + saving one more will involve turning an STR into an STP. */ + if (GP_REGNUM_P (regno) && nregs == 1 && (ngprs & 1)) + return 0; + + /* If this would be the first register that we save, add the cost of + allocating or deallocating the frame. For GPR saves, the allocation + and deallocation can be folded into the save and restore. */ + if (!existing_spill_p + && ngprs == 0 + && !GP_REGNUM_P (regno)) + mem_cost += 2; + + return mem_cost; +} + +/* Implement TARGET_FRAME_ALLOCATION_COST. */ +static int +aarch64_frame_allocation_cost (frame_cost_type, + const HARD_REG_SET &callee_saved_regs) +{ + if (aarch64_count_gpr_saves (callee_saved_regs) == 0) + return 2; + return 0; +} + /* Implement TARGET_INSN_COST. We have the opportunity to do something much more productive here, such as using insn attributes to cost things. But we don't, not yet. @@ -31572,6 +31626,12 @@ aarch64_libgcc_floating_mode_supported_p #undef TARGET_MEMORY_MOVE_COST #define TARGET_MEMORY_MOVE_COST aarch64_memory_move_cost +#undef TARGET_CALLEE_SAVE_COST +#define TARGET_CALLEE_SAVE_COST aarch64_callee_save_cost + +#undef TARGET_FRAME_ALLOCATION_COST +#define TARGET_FRAME_ALLOCATION_COST aarch64_frame_allocation_cost + #undef TARGET_MIN_DIVISIONS_FOR_RECIP_MUL #define TARGET_MIN_DIVISIONS_FOR_RECIP_MUL aarch64_min_divisions_for_recip_mul diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 0de24eda6f0..171bb8c0b77 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -7003,6 +7003,74 @@ value to the result of that function. The arguments to that function are the same as to this target hook. @end deftypefn +@deftypefn {Target Hook} int TARGET_CALLEE_SAVE_COST (spill_cost_type @var{cost_type}, unsigned int @var{hard_regno}, machine_mode @var{mode}, unsigned int @var{nregs}, int @var{mem_cost}, const HARD_REG_SET @var{&allocated_callee_regs}, bool @var{existing_spills_p}) +Return the one-off cost of saving or restoring callee-saved registers +(also known as call-preserved register or non-volatile registers). +The parameters are as follows: + +@itemize +@item +@var{cost_type} is @samp{spill_cost_type::SAVE} for saving a register +and @samp{spill_cost_type::RESTORE} for restoring a register. + +@item +@var{hard_regno} and @var{mode} represent the whole register that +the register allocator is considering using; of these, +@var{nregs} registers are fully or partially callee-saved. + +@item +@var{mem_cost} is the normal cost for storing (for saves) +or loading (for restores) the @var{nregs} registers. + +@item +@var{allocated_callee_regs} is the set of callee-saved registers +that are already in use. + +@item +@var{existing_spills_p} is true if the register allocator has +already decided to spill registers to memory. +@end itemize + +If @var{existing_spills_p} is false, the cost of a save should account +for frame allocations in a way that is consistent with +@code{TARGET_FRAME_ALLOCATION_COST}'s handling of allocations for +caller-saved registers. Similarly, the cost of a restore should then +account for frame deallocations in a way that is consistent with +@code{TARGET_FRAME_ALLOCATION_COST}'s handling of deallocations. + +Note that this hook should not attempt to apply a frequency scale +to the cost: it is the caller's responsibility to do that where +appropriate. + +The default implementation returns @var{mem_cost}, plus the allocation +or deallocation cost returned by @code{TARGET_FRAME_ALLOCATION_COST}, +where appropriate. +@end deftypefn + +@deftypefn {Target Hook} int TARGET_FRAME_ALLOCATION_COST (frame_cost_type @var{cost_type}, const HARD_REG_SET @var{&allocated_callee_regs}) +Return the cost of allocating or deallocating a frame for the sake of +using caller-saved registers; @var{cost_type} chooses between allocation +and deallocation. + +This hook is only called if the register allocator has not so far +decided to spill any pseudo registers to memory or to use caller-saved +registers for values that are live across a call. It may have +decided to use callee-saved registers; if so, @var{allocated_callee_regs} +is the set of callee-saved registers it has used. There might also be +other reasons why a stack frame is already needed; for example, +@samp{get_frame_size ()} may be nonzero, or the target might already +require a frame for target-specific reasons. + +When the register allocator uses this hook to cost caller-saved registers, +it also uses @code{TARGET_CALLEE_SAVE_COST} to cost new callee-saved +registers, passing @samp{false} as the @var{existing_spills_p} argument. +The intention is to allow the target to apply an apples-for-apples +comparison between callee-saved and caller-saved registers when the +allocator has not yet committed to using registers of both types. + +The default implementation returns 0. +@end deftypefn + @defmac BRANCH_COST (@var{speed_p}, @var{predictable_p}) A C expression for the cost of a branch instruction. A value of 1 is the default; other values are interpreted relative to that. Parameter diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 631d04131e3..eccc4d88493 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -4582,6 +4582,10 @@ These macros are obsolete, new ports should use the target hook @hook TARGET_MEMORY_MOVE_COST +@hook TARGET_CALLEE_SAVE_COST + +@hook TARGET_FRAME_ALLOCATION_COST + @defmac BRANCH_COST (@var{speed_p}, @var{predictable_p}) A C expression for the cost of a branch instruction. A value of 1 is the default; other values are interpreted relative to that. Parameter diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc index 0699b349a1a..9aaac794f35 100644 --- a/gcc/ira-color.cc +++ b/gcc/ira-color.cc @@ -1199,6 +1199,9 @@ finish_update_cost_records (void) register was already allocated for an allocno. */ static bool allocated_hardreg_p[FIRST_PSEUDO_REGISTER]; +/* Which callee-saved hard registers we've decided to save. */ +static HARD_REG_SET allocated_callee_save_regs; + /* Describes one element in a queue of allocnos whose costs need to be updated. Each allocno in the queue is known to have an allocno class. */ @@ -1740,6 +1743,20 @@ check_hard_reg_p (ira_allocno_t a, int hard_regno, return j == nregs; } +/* Record that we have allocated NREGS registers starting at HARD_REGNO. */ + +static void +record_allocation (int hard_regno, int nregs) +{ + for (int i = 0; i < nregs; ++i) + if (!allocated_hardreg_p[hard_regno + i]) + { + allocated_hardreg_p[hard_regno + i] = true; + if (!crtl->abi->clobbers_full_reg_p (hard_regno + i)) + SET_HARD_REG_BIT (allocated_callee_save_regs, hard_regno + i); + } +} + /* Return number of registers needed to be saved and restored at function prologue/epilogue if we allocate HARD_REGNO to hold value of MODE. */ @@ -1961,6 +1978,14 @@ assign_hard_reg (ira_allocno_t a, bool retry_p) #endif auto_bitmap allocnos_to_spill; HARD_REG_SET soft_conflict_regs = {}; + int entry_freq = REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)); + int exit_freq = REG_FREQ_FROM_BB (EXIT_BLOCK_PTR_FOR_FN (cfun)); + /* Whether we have spilled pseudos or used caller-saved registers + for values that are live across a call. + + FIXME: also check for pseudos that prefer NO_REGS and not equivalences. + Also account for regular spills. */ + bool existing_spills_p = caller_save_needed; ira_assert (! ALLOCNO_ASSIGNED_P (a)); get_conflict_and_start_profitable_regs (a, retry_p, @@ -2175,17 +2200,45 @@ assign_hard_reg (ira_allocno_t a, bool retry_p) /* We need to save/restore the hard register in epilogue/prologue. Therefore we increase the cost. */ { + int nregs = hard_regno_nregs (hard_regno, mode); + add_cost = 0; rclass = REGNO_REG_CLASS (hard_regno); - add_cost = ((ira_memory_move_cost[mode][rclass][0] - + ira_memory_move_cost[mode][rclass][1]) - * saved_nregs / hard_regno_nregs (hard_regno, - mode) - 1) - * (optimize_size ? 1 : - REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun))); + + auto entry_cost = targetm.callee_save_cost + (spill_cost_type::SAVE, hard_regno, mode, saved_nregs, + ira_memory_move_cost[mode][rclass][0] * saved_nregs / nregs, + allocated_callee_save_regs, existing_spills_p); + add_cost += entry_cost * entry_freq; + + auto exit_cost = targetm.callee_save_cost + (spill_cost_type::RESTORE, hard_regno, mode, saved_nregs, + ira_memory_move_cost[mode][rclass][1] * saved_nregs / nregs, + allocated_callee_save_regs, existing_spills_p); + add_cost += exit_cost * exit_freq; + + /* In the event of a tie between caller-save and callee-save, + prefer callee-save. */ + add_cost -= 1; + cost += add_cost; full_cost += add_cost; } } + if (!existing_spills_p && ira_need_caller_save_p (a, hard_regno)) + { + add_cost = 0; + + auto entry_cost = targetm.frame_allocation_cost + (frame_cost_type::ALLOCATION, allocated_callee_save_regs); + add_cost += entry_cost * entry_freq; + + auto exit_cost = targetm.frame_allocation_cost + (frame_cost_type::DEALLOCATION, allocated_callee_save_regs); + add_cost += exit_cost * exit_freq; + + cost += add_cost; + full_cost += add_cost; + } if (min_cost > cost) min_cost = cost; if (min_full_cost > full_cost) @@ -2212,8 +2265,8 @@ assign_hard_reg (ira_allocno_t a, bool retry_p) fail: if (best_hard_regno >= 0) { - for (i = hard_regno_nregs (best_hard_regno, mode) - 1; i >= 0; i--) - allocated_hardreg_p[best_hard_regno + i] = true; + record_allocation (best_hard_regno, + hard_regno_nregs (best_hard_regno, mode)); spill_soft_conflicts (a, allocnos_to_spill, soft_conflict_regs, best_hard_regno); } @@ -3369,8 +3422,7 @@ improve_allocation (void) /* Assign the best chosen hard register to A. */ ALLOCNO_HARD_REGNO (a) = best; - for (j = nregs - 1; j >= 0; j--) - allocated_hardreg_p[best + j] = true; + record_allocation (best, nregs); if (internal_flag_ira_verbose > 2 && ira_dump_file != NULL) fprintf (ira_dump_file, "Assigning %d to a%dr%d\n", diff --git a/gcc/target.def b/gcc/target.def index 4050b2ebdd4..fdd80c06c77 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -3775,6 +3775,80 @@ are the same as to this target hook.", int, (machine_mode mode, reg_class_t rclass, bool in), default_memory_move_cost) +DEFHOOK +(callee_save_cost, + "Return the one-off cost of saving or restoring callee-saved registers\n\ +(also known as call-preserved register or non-volatile registers).\n\ +The parameters are as follows:\n\ +\n\ +@itemize\n\ +@item\n\ +@var{cost_type} is @samp{spill_cost_type::SAVE} for saving a register\n\ +and @samp{spill_cost_type::RESTORE} for restoring a register.\n\ +\n\ +@item\n\ +@var{hard_regno} and @var{mode} represent the whole register that\n\ +the register allocator is considering using; of these,\n\ +@var{nregs} registers are fully or partially callee-saved.\n\ +\n\ +@item\n\ +@var{mem_cost} is the normal cost for storing (for saves)\n\ +or loading (for restores) the @var{nregs} registers.\n\ +\n\ +@item\n\ +@var{allocated_callee_regs} is the set of callee-saved registers\n\ +that are already in use.\n\ +\n\ +@item\n\ +@var{existing_spills_p} is true if the register allocator has\n\ +already decided to spill registers to memory.\n\ +@end itemize\n\ +\n\ +If @var{existing_spills_p} is false, the cost of a save should account\n\ +for frame allocations in a way that is consistent with\n\ +@code{TARGET_FRAME_ALLOCATION_COST}'s handling of allocations for\n\ +caller-saved registers. Similarly, the cost of a restore should then\n\ +account for frame deallocations in a way that is consistent with\n\ +@code{TARGET_FRAME_ALLOCATION_COST}'s handling of deallocations.\n\ +\n\ +Note that this hook should not attempt to apply a frequency scale\n\ +to the cost: it is the caller's responsibility to do that where\n\ +appropriate.\n\ +\n\ +The default implementation returns @var{mem_cost}, plus the allocation\n\ +or deallocation cost returned by @code{TARGET_FRAME_ALLOCATION_COST},\n\ +where appropriate.", + int, (spill_cost_type cost_type, unsigned int hard_regno, + machine_mode mode, unsigned int nregs, int mem_cost, + const HARD_REG_SET &allocated_callee_regs, bool existing_spills_p), + default_callee_save_cost) + +DEFHOOK +(frame_allocation_cost, + "Return the cost of allocating or deallocating a frame for the sake of\n\ +using caller-saved registers; @var{cost_type} chooses between allocation\n\ +and deallocation.\n\ +\n\ +This hook is only called if the register allocator has not so far\n\ +decided to spill any pseudo registers to memory or to use caller-saved\n\ +registers for values that are live across a call. It may have\n\ +decided to use callee-saved registers; if so, @var{allocated_callee_regs}\n\ +is the set of callee-saved registers it has used. There might also be\n\ +other reasons why a stack frame is already needed; for example,\n\ +@samp{get_frame_size ()} may be nonzero, or the target might already\n\ +require a frame for target-specific reasons.\n\ +\n\ +When the register allocator uses this hook to cost caller-saved registers,\n\ +it also uses @code{TARGET_CALLEE_SAVE_COST} to cost new callee-saved\n\ +registers, passing @samp{false} as the @var{existing_spills_p} argument.\n\ +The intention is to allow the target to apply an apples-for-apples\n\ +comparison between callee-saved and caller-saved registers when the\n\ +allocator has not yet committed to using registers of both types.\n\ +\n\ +The default implementation returns 0.", + int, (frame_cost_type cost_type, const HARD_REG_SET &allocated_callee_regs), + default_frame_allocation_cost) + DEFHOOK (use_by_pieces_infrastructure_p, "GCC will attempt several strategies when asked to copy between\n\ diff --git a/gcc/target.h b/gcc/target.h index 3e1ee68a341..2bf35e2d0ee 100644 --- a/gcc/target.h +++ b/gcc/target.h @@ -284,6 +284,18 @@ enum poly_value_estimate_kind POLY_VALUE_LIKELY }; +enum class spill_cost_type +{ + SAVE, + RESTORE +}; + +enum class frame_cost_type +{ + ALLOCATION, + DEALLOCATION +}; + typedef void (*emit_support_tinfos_callback) (tree); extern bool verify_type_context (location_t, type_context_kind, const_tree, diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc index f80dc8b2e7e..a8331ba5576 100644 --- a/gcc/targhooks.cc +++ b/gcc/targhooks.cc @@ -2075,6 +2075,29 @@ default_register_move_cost (machine_mode mode ATTRIBUTE_UNUSED, #endif } +int +default_callee_save_cost (spill_cost_type spill_type, unsigned int, + machine_mode, unsigned int, int mem_cost, + const HARD_REG_SET &callee_saved_regs, + bool existing_spills_p) +{ + if (!existing_spills_p) + { + auto frame_type = (spill_type == spill_cost_type::SAVE + ? frame_cost_type::ALLOCATION + : frame_cost_type::DEALLOCATION); + mem_cost += targetm.frame_allocation_cost (frame_type, + callee_saved_regs); + } + return mem_cost; +} + +int +default_frame_allocation_cost (frame_cost_type, const HARD_REG_SET &) +{ + return 0; +} + /* The default implementation of TARGET_SLOW_UNALIGNED_ACCESS. */ bool diff --git a/gcc/targhooks.h b/gcc/targhooks.h index 7d15f632c23..0c4c27e257e 100644 --- a/gcc/targhooks.h +++ b/gcc/targhooks.h @@ -234,6 +234,11 @@ extern tree default_builtin_tm_load_store (tree); extern int default_memory_move_cost (machine_mode, reg_class_t, bool); extern int default_register_move_cost (machine_mode, reg_class_t, reg_class_t); +extern int default_callee_save_cost (spill_cost_type, unsigned int, + machine_mode, unsigned int, int, + const HARD_REG_SET &, bool); +extern int default_frame_allocation_cost (frame_cost_type, + const HARD_REG_SET &); extern bool default_slow_unaligned_access (machine_mode, unsigned int); extern HOST_WIDE_INT default_estimated_poly_value (poly_int64, poly_value_estimate_kind); -- 2.25.1