fistp

peter at cordes dot ca Fri, 27 May 2016 20:49:26 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245


--- Comment #3 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Uroš Bizjak from comment #2)
> Recently x86 linux changed the barrier to what you propose. If it is worth,
> we can change it without any problems.

I guess it costs a code byte for a disp8 in the addressing mode, but it avoids
adding a lot of latency to a critical path involving a spill/reload to (%esp),
in functions where there is something at (%esp).

If it's an object larger than 4B, the lock orl could even cause a
store-forwarding stall when the object is reloaded.  (e.g. a double or a
vector).

Ideally we could do the  lock orl  on some padding between two locals, or on
something in memory that wasn't going to be loaded soon, to avoid touching more
stack memory (which might be in the next page down).  But we still want to do
it on a cache line that's hot, so going way up above our own stack frame isn't
good either.

> OTOH, we have "orl" here - should we
> change it to "addl" to be consistent with kernel?

That's the common idiom I've seen, but there's no reason I know of to favour
ADD instead of OR.  They both write all the flags, and both can run on any ALU
port on every microarchitecture.  Since gcc has been using OR already with I
assume nobody reporting perf problems, we should keep it.

A 32bit operand size is still a good choice.  (The obvious alternative being
8bit, but that doesn't save any code size.  From Agner Fog's insn tables, I
don't see any different entry for locked instructions with m8 vs. m32 operands,
but naturally-aligned 32bit loads/stores are probably the safest bet.)

[Bug target/71245] std::atomic load/store bounces the data to the stack using fild/fistp

Reply via email to