On Sun, May 29, 2016 at 11:10:15PM +0200, Uros Bizjak wrote:
> As explained in PR71245, comment #3 [1], it is better to use offset -4
> to a %esp to implement a non-SSE memory fence instruction:
> 
> -q-
> 
> I guess it costs a code byte for a disp8 in the addressing mode, but
> it avoids adding a lot of latency to a critical path involving a
> spill/reload to (%esp), in functions where there is something at
> (%esp).
> 
> If it's an object larger than 4B, the lock orl could even cause a
> store-forwarding stall when the object is reloaded.  (e.g. a double or
> a vector).
> 
> Ideally we could do the  lock orl  on some padding between two locals,
> or on something in memory that wasn't going to be loaded soon, to
> avoid touching more stack memory (which might be in the next page
> down).  But we still want to do it on a cache line that's hot, so
> going way up above our own stack frame isn't good either.

Unfortunately this makes valgrind unhappy about that:
https://bugzilla.redhat.com/show_bug.cgi?id=1423434
I assume it will complain now on anything pre-SSE2 that contains the memory
barrier in 32-bit code.
Perhaps we should decrement and increment %esp around it or something
similar (or push/pop)?  Of course, that would mean we need to take care
of async unwind info.

> Attached RFC patch implements this proposal.
> 
> 2016-05-29  Uros Bizjak  <ubiz...@gmail.com>
> 
>     * config/i386/sync.md (mfence_nosse): Use "lock orl $0, -4(%esp)".
> 
> Patch was bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> 
> Any other opinion on this issue? The linux kernel also implements
> memory fence like the above proposal.
> 
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245#c3
> 
> Uros.

> Index: config/i386/sync.md
> ===================================================================
> --- config/i386/sync.md       (revision 236863)
> +++ config/i386/sync.md       (working copy)
> @@ -98,7 +98,7 @@
>       (unspec:BLK [(match_dup 0)] UNSPEC_MFENCE))
>     (clobber (reg:CC FLAGS_REG))]
>    "!(TARGET_64BIT || TARGET_SSE2)"
> -  "lock{%;} or{l}\t{$0, (%%esp)|DWORD PTR [esp], 0}"
> +  "lock{%;} or{l}\t{$0, -4(%%esp)|DWORD PTR [esp-4], 0}"
>    [(set_attr "memory" "unknown")])
>  
>  (define_expand "mem_thread_fence"


        Jakub

Reply via email to