On Sun, May 29, 2016 at 11:10:15PM +0200, Uros Bizjak wrote: > As explained in PR71245, comment #3 [1], it is better to use offset -4 > to a %esp to implement a non-SSE memory fence instruction: > > -q- > > I guess it costs a code byte for a disp8 in the addressing mode, but > it avoids adding a lot of latency to a critical path involving a > spill/reload to (%esp), in functions where there is something at > (%esp). > > If it's an object larger than 4B, the lock orl could even cause a > store-forwarding stall when the object is reloaded. (e.g. a double or > a vector). > > Ideally we could do the lock orl on some padding between two locals, > or on something in memory that wasn't going to be loaded soon, to > avoid touching more stack memory (which might be in the next page > down). But we still want to do it on a cache line that's hot, so > going way up above our own stack frame isn't good either.
Unfortunately this makes valgrind unhappy about that: https://bugzilla.redhat.com/show_bug.cgi?id=1423434 I assume it will complain now on anything pre-SSE2 that contains the memory barrier in 32-bit code. Perhaps we should decrement and increment %esp around it or something similar (or push/pop)? Of course, that would mean we need to take care of async unwind info. > Attached RFC patch implements this proposal. > > 2016-05-29 Uros Bizjak <ubiz...@gmail.com> > > * config/i386/sync.md (mfence_nosse): Use "lock orl $0, -4(%esp)". > > Patch was bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. > > Any other opinion on this issue? The linux kernel also implements > memory fence like the above proposal. > > [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245#c3 > > Uros. > Index: config/i386/sync.md > =================================================================== > --- config/i386/sync.md (revision 236863) > +++ config/i386/sync.md (working copy) > @@ -98,7 +98,7 @@ > (unspec:BLK [(match_dup 0)] UNSPEC_MFENCE)) > (clobber (reg:CC FLAGS_REG))] > "!(TARGET_64BIT || TARGET_SSE2)" > - "lock{%;} or{l}\t{$0, (%%esp)|DWORD PTR [esp], 0}" > + "lock{%;} or{l}\t{$0, -4(%%esp)|DWORD PTR [esp-4], 0}" > [(set_attr "memory" "unknown")]) > > (define_expand "mem_thread_fence" Jakub