On Tue, 24 May 2022 18:36:27 PDT (-0700), Vineet Gupta wrote:


On 5/24/22 18:32, Palmer Dabbelt wrote:

Ping, IMO this needs to be (re)considered for trunk.
This goes really nicely with riscv_slow_unaligned_access_p==false, to
elide the unrolled tail copies for trailer word/sword/byte accesses.

@Kito, @Palmer ? Just from codegen pov this seems to be a no brainer

Has anything changed since this was posted?

IIRC the discussion essentially boiled down to that overlapping store
likely being a hard case on in-order machines (like the C906), but
there weren't any benchmarks or documentation so we could figure that
out.  I don't see how this is an obvious win: sure it's fewer ops (and
assuming a uniform distribution fewer misaligned accesses, though I
don't know how reasonable uniform distributions are here), but it's
only a small upside so that hard case would have to be fast in order
for this to be better code.

If someone has benchmarks showing these are actually faster on the
C906 (or even some documentation describing how these accesses are
handled) then I'm happy to take the code (with the -Os bit fixed).  It
shouldn't be all that hard of a benchmark to run...

Will this be acceptable, if this was a per cpu knob then ? There seem to
be existing OoO RV cores too !

It's being added as a per-cpu knob, it's just only being turned on for the C906 and -Os tunings where it's not obviously a win.

I'm certainly not saying nobody builds this flavor of machine, certainly Intel does as it's on for their machines, just that there's no solid evidence the C906 behaves this way. Given that this flag had been explicitly discussed not to include generating misaligned accesses on purpose during the Os discussions, I don't want to just flip it over on a vendor and risk a performance regression.

The only other pipeline models are for in-order SiFive processors that trap into M-mode for unaligned accesses, so this sort of thing doesn't apply (though it's part of the reason -Os doesn't do this, as they're still pretty common).



foo:
     sd    zero,0(a0)
     sw    zero,8(a0)
     sh    zero,12(a0)
     sb    zero,14(a0)

vs.

     sd    zero,0(a0)
     sd    zero,7(a0)


Reply via email to