On 8/11/23 17:32, Vineet Gupta wrote:
On 8/1/23 12:17, Vineet Gupta wrote:
Hi Jeff,
As discussed this morning, I'm sending over dumps for the optim of DF
const -0.0 (PR/110748) [1]
For rv64gc_zbs build, IRA is undoing the split which eventually leads
to ICE in final pass.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748#c15
void znd(double *d) { *d = -0.0; }
*split1*
(insn 10 3 11 2 (set (reg:DI 136)
(const_int [0x8000000000000000])) "neg.c":4:5 -1
(insn 11 10 0 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
(subreg:DF (reg:DI 136) 0)) "neg.c":4:5 -1
*ira*
(insn 11 9 12 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
(const_double:DF -0.0 [-0x0.0p+0])) "neg.c":4:5 190
{*movdf_hardfloat_rv64}
(expr_list:REG_DEAD (reg:DI 135)
For the working case, the large const is not involved and not subject
to IRA playing foul.
I investigated this some more. So IRA update_equiv_regs () has code
identifying potential replacements: if a reg is referenced exactly
twice: set once and used once.
if (REG_N_REFS (regno) == 2
&& (rtx_equal_p (replacement, src)
|| ! equiv_init_varies_p (src))
&& NONJUMP_INSN_P (insn)
&& equiv_init_movable_p (PATTERN (insn), regno))
reg_equiv[regno].replace = 1;
}
And combine_and_move_insns () does the replacement, undoing the split1
above.
Right. This is as expected. There was actually similar code that goes
back even before the introduction of IRA -- like to the 80s and 90s.
Conceptually the idea is a value with an equivalence that has a single
set and single use isn't a good use of a hard register. Better to
narrow the live range to a single pair of instructions.
It's not always a good tradeoff. Consider if the equivalence was also a
loop invariant and hoisted out of the loop and register pressure is low.
In fact this is the reason for many more split1 being undone. See the
suboptimal codegen for large const for Andrew Pinski's test case [1]
No doubt. I think it's also a problem with some of Jivan's work.
I'm wondering (naively) if there is some way to tune this - for a given
backend. In general it would make sense to do the replacement, but not
if the cost changes (e.g. consts could be embedded in x86 insn freely,
but not for RISC-V where this is costly and if something is split, it
might been intentional.
I'm not immediately aware of a way to tune.
When it comes to tuning, the toplevel questions are do we have any of
the info we need to tune at the point where the transformation occurs.
The two most obvious pieces here would be loop info an register pressure.
ie, do we have enough loop structure to know if the def is at a
shallower loop nest than the use. There's a reasonable chance we have
this information as my recollection is this analysis is done fairly
early in IRA.
But that means we likely don't have any sense of register pressure at
the points between the def and use. So the most useful metric for
tuning isn't really available.
The one thing that stands out is we don't do this transformation at all
when register pressure sensitive scheduling is enabled. And we really
should be turning that on by default. Our data shows register pressure
sensitive scheduling is about a 6-7% cycle improvement on x264 as it
avoids spilling in those key satd loops.
/* Don't move insns if live range shrinkage or register
pressure-sensitive scheduling were done because it will not
improve allocation but likely worsen insn scheduling. */
if (optimize
&& !flag_live_range_shrinkage
&& !(flag_sched_pressure && flag_schedule_insns))
combine_and_move_insns ();
So you might want to look at register pressure sensitive scheduling
first. If you go into x264_r from specint and look at
x264_pixel_satd_8x4. First verify the loops are fully unrolled. If
they are, then look for 32bit loads/stores into the stack. If you have
them, then you're spilling and getting crappy performance. Using
register pressure sensitive scheduling should help significantly.
We've certainly seen that internally. The plan was to submit a patch to
make register pressure sensitive scheduling the default when the
scheduler is enabled. We just haven't pushed on it. If you can verify
that you're seeing spilling as well, then it'd certainly bolster the
argument that register-pressure-sensitive-scheduling is desirable.
Jeff