2013/6/3 dw <limegreenso...@yahoo.com>:
> So, a question about __faststorefence.  The current implementation in
> winnt.h is incorrect.  I have 3 alternates to propose, and which one is
> "best" depends on the goals of the mingw-w64 project.  One approach is "just
> do what MSVC does."  However, there's also something to be said for
> "generate the fastest possible code."  And for completeness, there's also
> "use a built-in."  Details with pros/cons below.
>
> First, the current code:
>
>     __MINGW_INTRIN_INLINE void __faststorefence(void) {
>       __asm__ __volatile__ ("" ::: "memory");
>     }
>
> While the "memory" clobber generates a readwritebarrier() for the compiler,
> __faststorefence must also generate some type of fence instruction for the
> processor (sfence, mfence, lock).
>
> So, we can:
>
> 1) Just map this to __sync_synchronize().  This does the full memory
> compiler barrier, and generates an mfence instruction.
> pros:
> - Uses builtin.
>
> cons:
> - Generates mfence instead of sfence (see timing #'s below).
> - Generates mfence even if compiled with -mno-sse (mfence is sse2).
> - Generates mfence instead of the "lock or DWORD PTR [rsp], 0" which MSVC
> generates.
>
> 2) Map this to the same as MSVC.  The "memory" clobber ensures the compiler
> barrier, and the "lock" provides the fence:
>
> asm ("lock or %[zero], (%%rsp)" :: [zero] "ri" (0) : "memory", "cc")
>
> pros:
> - consistent with MSVC.
>
> cons:
> - While sfence may have been slower when first introduced, it's faster than
> "or" now (see #'s below).
>
> 3) Use code like:
>
>     __MINGW_INTRIN_INLINE void x__faststorefence(void) {
>
> #ifdef __SSE__ // defined by gcc when sse instructions are available
>       asm ("sfence" ::: "memory");
> #else
>       asm ("lock or %[zero], (%%rsp)" :: [zero] "ri" (0) : "memory", "cc");
> #endif
>
>     }
>
> Pros:
> - Uses the faster sfence if available.
> - Falls back to "or" for max compatibility.
>
> cons:
> - Not consistent with MSVC.
> - SFENCE is not necessarily the fastest on all processors.
>
> I ran some timings using x64 on my i7, and this is what I find:
>
> _mm_sfence:  3,589,817,193
> lock or   : 14,960,719,245
> _mm_mfence: 19,608,594,657
>
> Obviously these results are going to be both highly hw specific and depend
> heavily on the code surrounding them.  Still...
>
> If I were going to pick, I'd probably go with #3.  It isn't 100% identical
> to MSVC, but it effectively produces the same results, and will (at least on
> current processors) generate faster code.
>
> Opinions?
>
> dw

I think option #3 is that one I prefer too.  Just one thing about
SSE-instruction.  For 64-bit we can assume that SSE has to be present
in any case.  Just for 32-bit we should check in headers for the
__SSE__ macro, and in the intrinsic-function (none-inline) we should
default to none-SSE version.

Kai

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Mingw-w64-public mailing list
Mingw-w64-public@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to