On 6/10/24 1:55 AM, Manolis Tsamis wrote:
There was an older submission of a load-pair specific pass but this is
a complete reimplementation and indeed significantly more general.
Apart from being target independant, it addresses a number of
important restrictions and can handle multiple store forwardings per
load.
It should be noted that it cannot handle the load-pair cases as these
need special handling, but that's something we're planning to do in
the future by reusing this infrastructure.
ACK. Thanks for the additional background.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 4e8967fd8ab..c769744d178 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12657,6 +12657,15 @@ loop unrolling.
This option is enabled by default at optimization levels @option{-O1},
@option{-O2}, @option{-O3}, @option{-Os}.
+@opindex favoid-store-forwarding
+@item -favoid-store-forwarding
+@itemx -fno-avoid-store-forwarding
+Many CPUs will stall for many cycles when a load partially depends on previous
+smaller stores. This pass tries to detect such cases and avoid the penalty by
+changing the order of the load and store and then fixing up the loaded value.
+
+Disabled by default.
Is there any particular reason why this would be off by default at -O1
or higher? It would seem to me that on modern cores that this
transformation should easily be a win. Even on an old in-order core,
avoiding the load with the bit insert is likely profitable, just not as
much so.
I don't have a strong opinion for that but I believe Richard's
suggestion to decide this on a per-target basis also makes a lot of
sense.
Deciding whether the transformation is profitable is tightly tied to
the architecture in question (i.e. how large the stall is and what
sort of bit-insert instructions are available).
In order to make this more widely applicable, I think we'll need a
target hook that decides in which case the forwarded stores incur a
penalty and thus the transformation makes sense.
You and Richi are probably right. I'm not a big fan of passes being
enabled/disabled on particular targets, but it may make sense here.
Afaik, for each CPU there may be cases that store forwarding is
handled efficiently.
Absolutely. But forwarding from a smaller store to a wider load is
painful from a hardware standpoint and if we can avoid it from a codegen
standpoint, we should.
Did y'all look at spec2017 at all for this patch? I've got our hardware
guys to expose a signal for this case so that we can (in a month or so)
get some hard data on how often it's happening in spec2017 and evaluate
how this patch helps the most affected workloads. But if y'all already
have some data we can use it as a starting point.
jeff