On 8/14/23 18:35, Vineet Gupta wrote:
On 8/11/23 17:04, Jeff Law wrote:
I'm wondering (naively) if there is some way to tune this - for a
given backend. In general it would make sense to do the replacement,
but not if the cost changes (e.g. consts could be embedded in x86
insn freely, but not for RISC-V where this is costly and if something
is split, it might been intentional.
I'm not immediately aware of a way to tune.
When it comes to tuning, the toplevel questions are do we have any of
the info we need to tune at the point where the transformation occurs.
The two most obvious pieces here would be loop info an register pressure.
ie, do we have enough loop structure to know if the def is at a
shallower loop nest than the use. There's a reasonable chance we have
this information as my recollection is this analysis is done fairly
early in IRA.
But that means we likely don't have any sense of register pressure at
the points between the def and use. So the most useful metric for
tuning isn't really available.
I'd argue that even if the register pressure were high, in some cases,
there's just no way around it and RA needs to honor what the backend did
apriori (split in this case), otherwise we end up with something which
doesn't compute literally and leads to ICE. I'm puzzled that in this
case, intentional implementation is getting in the way. So while I don't
care about the -0.0 case in itself, it seems with the current framework
we can't just achieve the results, other that the roundabout way of
peephole2 you alluded to.
I think you'll run into a lot of resistance with that approach. The
fact it we're being a bit sneaky and telling a bit of a fib in the
backend (claiming support for certain capabilities that don't actually
exist).
As many have said, lie to GCC and ultimately it will gets revenge. This
is but one example.
When we lie to some parts of gcc, we may well trigger undesirable
behavior later in the pipeline. It's a tradeoff and sometimes we have
to back out those little lies.
The one thing that stands out is we don't do this transformation at
all when register pressure sensitive scheduling is enabled. And we
really should be turning that on by default. Our data shows register
pressure sensitive scheduling is about a 6-7% cycle improvement on
x264 as it avoids spilling in those key satd loops.
/* Don't move insns if live range shrinkage or register
pressure-sensitive scheduling were done because it will not
improve allocation but likely worsen insn scheduling. */
if (optimize
&& !flag_live_range_shrinkage
&& !(flag_sched_pressure && flag_schedule_insns))
combine_and_move_insns ();
So you might want to look at register pressure sensitive scheduling
first. If you go into x264_r from specint and look at
x264_pixel_satd_8x4. First verify the loops are fully unrolled. If
they are, then look for 32bit loads/stores into the stack. If you
have them, then you're spilling and getting crappy performance. Using
register pressure sensitive scheduling should help significantly.
Is that -fira-loop-pressure ?
-fsched-pressure I think.
We've certainly seen that internally. The plan was to submit a patch
to make register pressure sensitive scheduling the default when the
scheduler is enabled. We just haven't pushed on it. If you can
verify that you're seeing spilling as well, then it'd certainly
bolster the argument that register-pressure-sensitive-scheduling is
desirable.
I can confirm that the loop is fully unrolled and there's a zillion
stack spills there for intermediate computes (-Ofast
-march=rv64gc_zba_zbb_zbs, no V in that build).
Yea, you'll take a big hit from those spills. Good to get a
confirmation that you're seeing it too.
The fix should be pretty simple. We just turn on -fsched-pressure in
the RV backend.
Jeff