On 5/24/22 18:32, Palmer Dabbelt wrote:
Ping, IMO this needs to be (re)considered for trunk.
This goes really nicely with riscv_slow_unaligned_access_p==false, to
elide the unrolled tail copies for trailer word/sword/byte accesses.
@Kito, @Palmer ? Just from codegen pov this seems to be a no brainer
Has anything changed since this was posted?
IIRC the discussion essentially boiled down to that overlapping store
likely being a hard case on in-order machines (like the C906), but
there weren't any benchmarks or documentation so we could figure that
out. I don't see how this is an obvious win: sure it's fewer ops (and
assuming a uniform distribution fewer misaligned accesses, though I
don't know how reasonable uniform distributions are here), but it's
only a small upside so that hard case would have to be fast in order
for this to be better code.
If someone has benchmarks showing these are actually faster on the
C906 (or even some documentation describing how these accesses are
handled) then I'm happy to take the code (with the -Os bit fixed). It
shouldn't be all that hard of a benchmark to run...
Will this be acceptable, if this was a per cpu knob then ? There seem to
be existing OoO RV cores too !
foo:
sd zero,0(a0)
sw zero,8(a0)
sh zero,12(a0)
sb zero,14(a0)
vs.
sd zero,0(a0)
sd zero,7(a0)