On Fri, Mar 7, 2025 at 11:30 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > On Tue, Mar 4, 2025 at 6:18 PM Richard Sandiford > <richard.sandif...@arm.com> wrote: > > > > Richard Sandiford <richard.sandif...@arm.com> writes: > > > Jan Hubicka <hubi...@ucw.cz> writes: > > >>> > > >>> Thanks for running these. I saw poor results for perlbench with my > > >>> initial aarch64 hooks because the hooks reduced the cost to zero for > > >>> the entry case: > > >>> > > >>> auto entry_cost = targetm.callee_save_cost > > >>> (spill_cost_type::SAVE, hard_regno, mode, saved_nregs, > > >>> ira_memory_move_cost[mode][rclass][0] * saved_nregs / nregs, > > >>> allocated_callee_save_regs, existing_spills_p); > > >>> /* In the event of a tie between caller-save and callee-save, > > >>> prefer callee-save. We apply this to the entry cost rather > > >>> than the exit cost since the entry frequency must be at > > >>> least as high as the exit frequency. */ > > >>> if (entry_cost > 0) > > >>> entry_cost -= 1; > > >>> > > >>> I "fixed" that by bumping the cost to a minimum of 2, but I was > > >>> wondering whether the "entry_cost > 0" should instead be "entry_cost > > > >>> 1", > > >>> so that the cost is always greater than not using a callee save for > > >>> registers that don't cross a call. WDYT? > > >> > > >> For x86 perfomance costs, the push cost should be memory_move_cost which > > >> is 6, -2 for adjustment in the target hook and -1 for this. So cost > > >> should not be 0 I think. > > >> > > >> For size cost, I currently return 1, so we indeed get 0 after > > >> adjustment. > > >> > > >> I think cost of 0 will make us to pick callee save even if caller save > > >> is available and there are no function calls, so I guess we do not want > > >> that.... > > > > > > OK, here's an updated patch that makes that change. The x86 parts > > > should be replaced by your patch. > > > > > > Tested on aarch64-linux-gnu. I also tried to test on > > > pwoerpc64el-linux-gnu > > > (on gcc112), but I keep getting broken pipes during the test runs, > > > so I'm struggling to get good before/after comparisons. It does at > > > least bootstrap though... > > > > Here's the patch with Honza's x86 changes. Boostrapped & regresiion-tested > > on aarch64-linux-gnu and powerpc64le-linux-gnu (gcc120). The powerpc64le > > results regressed: > > > > FAIL: gcc.dg/guality/vla-1.c -Os -DPREVENT_OPTIMIZATION line 24 i == 5 > > FAIL: gcc.dg/guality/vla-1.c -Os -DPREVENT_OPTIMIZATION line 24 sizeof > > (a) == 17 * sizeof (short) > > > > but the same test already failed for -O2 and -O3. > > > > OK to install now? Or, given the lateness in the release cycle, > > would it be better to wait for GCC 16? > > > > Thanks, > > Richard > > > > > > Following on from the discussion in: > > > > https://gcc.gnu.org/pipermail/gcc-patches/2025-February/675256.html > > > > this patch removes TARGET_IRA_CALLEE_SAVED_REGISTER_COST_SCALE and > > replaces it with two hooks: one that controls the cost of using an > > extra callee-saved register and one that controls the cost of allocating > > a frame for the first spill. > > > > (The patch does not attempt to address the shrink-wrapping part of > > the thread above.) > > > > On AArch64, this is enough to fix PR117477, as verified by the new tests. > > The patch does not change the SPEC2017 scores significantly. (I saw a > > slight improvement in fotonik3d and roms, but I'm not convinced that > > the improvements are real.) > > > > The patch makes IRA use caller saves for gcc.target/aarch64/pr103350-1.c, > > which is a scan-dump correctness test that relies on not using > > caller saves. The decision to use caller saves looks appropriate, > > and saves an instruction, so I've just added -fno-caller-saves > > to the test options. > > > > The x86 parts were written by Honza. > > > > gcc/ > > PR rtl-optimization/117477 > > * config/aarch64/aarch64.cc (aarch64_count_saves): New function. > > (aarch64_count_above_hard_fp_saves, aarch64_callee_save_cost) > > (aarch64_frame_allocation_cost): Likewise. > > (TARGET_CALLEE_SAVE_COST): Define. > > (TARGET_FRAME_ALLOCATION_COST): Likewise. > > * config/i386/i386.cc (ix86_ira_callee_saved_register_cost_scale): > > Replace with... > > (ix86_callee_save_cost): ...this new hook. > > (TARGET_IRA_CALLEE_SAVED_REGISTER_COST_SCALE): Delete. > > (TARGET_CALLEE_SAVE_COST): Define. > > * target.h (spill_cost_type, frame_cost_type): New enums. > > * target.def (callee_save_cost, frame_allocation_cost): New hooks. > > (ira_callee_saved_register_cost_scale): Delete. > > * doc/tm.texi.in (TARGET_IRA_CALLEE_SAVED_REGISTER_COST_SCALE): > > Delete. > > (TARGET_CALLEE_SAVE_COST, TARGET_FRAME_ALLOCATION_COST): New hooks. > > * doc/tm.texi: Regenerate. > > * hard-reg-set.h (hard_reg_set_popcount): New function. > > * ira-color.cc (allocated_memory_p): New variable. > > (allocated_callee_save_regs): Likewise. > > (record_allocation): New function. > > (assign_hard_reg): Use targetm.frame_allocation_cost to model > > the cost of the first spill or first caller save. Use > > targetm.callee_save_cost to model the cost of using new callee-saved > > registers. Apply the exit rather than entry frequency to the cost > > of restoring a register or deallocating the frame. Update the > > new variables above. > > (improve_allocation): Use record_allocation. > > (color): Initialize allocated_callee_save_regs. > > (ira_color): Initialize allocated_memory_p. > > * targhooks.h (default_callee_save_cost): Declare. > > (default_frame_allocation_cost): Likewise. > > * targhooks.cc (default_callee_save_cost): New function. > > (default_frame_allocation_cost): Likewise. > > > > gcc/testsuite/ > > PR rtl-optimization/117477 > > * gcc.target/aarch64/callee_save_1.c: New test. > > * gcc.target/aarch64/callee_save_2.c: Likewise. > > * gcc.target/aarch64/callee_save_3.c: Likewise. > > * gcc.target/aarch64/pr103350-1.c: Add -fno-caller-saves. > > > > Co-authored-by: Jan Hubicka <hubi...@ucw.cz> > > Here is the v2 patch. The only change is > > if (GENERAL_REGNO_P (hard_regno)) > { > /* push is 1 byte while typical spill is 4-5 bytes. > ??? We probably should adjust size costs accordingly. > Costs are relative to reg-reg move that has 2 bytes for 32bit > and 3 bytes otherwise. Be sure that no cost table sets cost > to 2, so we end up with 0. */ > if (mem_cost <= 2 || optimize_function_for_size_p (cfun)) > return 1; > return mem_cost - 2; > } > > in ix86_callee_save_cost. > > Tested on x86-64 with > > $ make check RUNTESTFLAGS="--target_board='unix{-m32,}'" > > OK for master?
Please give x86 maintainers a chance to chime in - when they are happy the previous approval still holds for the rest. Richard. > Thanks. > > H.J.