On 5/24/22 18:32, Palmer Dabbelt wrote:

Ping, IMO this needs to be (re)considered for trunk.
This goes really nicely with riscv_slow_unaligned_access_p==false, to
elide the unrolled tail copies for trailer word/sword/byte accesses.

@Kito, @Palmer ? Just from codegen pov this seems to be a no brainer

Has anything changed since this was posted?

IIRC the discussion essentially boiled down to that overlapping store likely being a hard case on in-order machines (like the C906), but there weren't any benchmarks or documentation so we could figure that out.  I don't see how this is an obvious win: sure it's fewer ops (and assuming a uniform distribution fewer misaligned accesses, though I don't know how reasonable uniform distributions are here), but it's only a small upside so that hard case would have to be fast in order for this to be better code.

If someone has benchmarks showing these are actually faster on the C906 (or even some documentation describing how these accesses are handled) then I'm happy to take the code (with the -Os bit fixed).  It shouldn't be all that hard of a benchmark to run...

Will this be acceptable, if this was a per cpu knob then ? There seem to be existing OoO RV cores too !


foo:
     sd    zero,0(a0)
     sw    zero,8(a0)
     sh    zero,12(a0)
     sb    zero,14(a0)

vs.

     sd    zero,0(a0)
     sd    zero,7(a0)



Reply via email to