On 7/3/25 3:50 PM, Palmer Dabbelt wrote:
This is really Jim's code, but it's been sitting around in Bugzilla for a while
so I've picked it up.  All I really did here is add a target hook and mangle
some comments, but I think I understand enough about what's going on to try and
get things moving forward.  So I'm writing up a pretty big cover letter to try
and summarize what I think is going on here, as it's definitely not something I
fully understand yet.

We've got a quirk in the RISC-V ABI where DF arguments on rv32 get split into
an X register and a 32-bit aligned stack slot.  The middle-end prologue code
just stores out the X register and treats the argument as if it was entirely
passed on the stack.  This can result in a misaligned load, and those are still
slow on a bunch of RISC-V systems.
Weird. I'd been constructing a response over the last week or so a little bit a a time. But I can't find it or any evidence of it.

Anyway, this isn't that weird of a quirk, other targets have partial in regs arguments. The PA SOM ABI had this property IIRC, but after playing with it a bit it's not useful to review for insights as it's a strict alignment target.

I also checked another old favorite, the V8, but it doesn't have DF mode moves, so not particularly intesting for that case.



Without this patch (and with this patch on targets with fast misaligned
accesses) that generates

         sw      a7,12(sp)
         fld     fa0,12(sp)

and with this patch (on a subtarget with slow misaligned access) ends up as

         lw      a5,16(sp)
         sw      a7,8(sp)
         sw      a5,12(sp)
         fld     fa0,8(sp)

That looks a little odd, but I think it's actually good code -- the only way to
get a double into a register on rv32 is to load it from memory, so without
misaligned loads we're sort of just stuck there.
Right. Both sequences likely trigger store-forward-bypass issues. The second is probably better when we don't have fast unaligned access.


While playing around writing this cover letter I came up with another case
that's essentially

     long long foo(..., long long split) { return split; }

that used to generate

         sw      a7,12(sp)
         lw      a0,12(sp)
         lw      a1,16(sp)

and now generates

         lw      a1,0(sp)
         mv      a0,a7
Yea, but I think this would would "just work" if we killed movdi for rv32. That's a hole I'm not going to dive into (can't justify the amount of time it would likely take to chase down regressions). But's a hole with a bit of cheese in it for someone who really cares about rv32.

Essentially for the long long case the existence of movdi means the component moves aren't seen and optimized away by CSE.

The target hook will need some adjustment, but ultimately I'm not even sure if
a target hook is the way to go here.  It was just an easy way to flip the
behavior so I could play around with some of Jim's code.  It kind of feels like
the load/subword merge version would result in better code in general, but I'm
not sure on that one.

That said, I figured I'd just send it out so others could see this.  It's very
much out of my wheel house, so I'd be shocked if this doesn't cause any
failures...
Not sure if it's your patch or some other instability, but pre-commit testing certainly wasn't happy. Failed to build glibc.

Conceptually I don't see any other viable approach for DF and slow unaligned access. There's no direct way to get transfer the value GPRs into the FPRs. You have to go through memory, which is a time honored tradition. Folks have spent a ton of time on these problems in the past.

Given the value has to be in memory at some point and it can't be in the original slot because it's unaligned, the only viable path is to allocate a new slot and initialize it efficiently. That can either happen during initial RTL generation like Jim & you have done or we could make it possible to hold DF values in GPRs, but split post-reload and hope DSE cleans things up.

Anyway, ISTM that attacking at RTL generation time is the better choice. So you'd need to determine if your patch is causing the glibc failure and if so, address that.

Jeff


Reply via email to