So, a question about __faststorefence. The current implementation in winnt.h is incorrect. I have 3 alternates to propose, and which one is "best" depends on the goals of the mingw-w64 project. One approach is "just do what MSVC does." However, there's also something to be said for "generate the fastest possible code." And for completeness, there's also "use a built-in." Details with pros/cons below.

First, the current code:

    __MINGW_INTRIN_INLINE void __faststorefence(void) {
      __asm__ __volatile__ ("" ::: "memory");
    }

While the "memory" clobber generates a readwritebarrier() for the compiler, __faststorefence must <http://msdn.microsoft.com/en-us/library/t710k390%28v=VS.80%29.aspx> also generate some type of fence instruction for the processor (sfence, mfence, lock).

So, we can:

1) Just map this to __sync_synchronize <http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/_005f_005fsync-Builtins.html#_005f_005fsync-Builtins>(). This does the full memory compiler barrier, and generates an mfence instruction.
pros:
- Uses builtin.

cons:
- Generates mfence instead of sfence (see timing #'s below).
- Generates mfence even if compiled with -mno-sse (mfence is sse2).
- Generates mfence instead of the "|lock or DWORD PTR [rsp], 0"| which MSVC generates.

2) Map this to the same as MSVC. The "memory" clobber ensures the compiler barrier, and the "lock" provides the fence:

asm ("lock or %[zero], (%%rsp)" :: [zero] "ri" (0) : "memory", "cc")

pros:
- consistent with MSVC.

cons:
- While sfence may have been slower when first introduced, it's faster than "or" now (see #'s below).

3) Use code like:

    __MINGW_INTRIN_INLINE void x__faststorefence(void) {

#ifdef __SSE__ // defined by gcc when sse instructions are available
      asm ("sfence" ::: "memory");
#else
      asm ("lock or %[zero], (%%rsp)" :: [zero] "ri" (0) : "memory", "cc");
#endif

    }

Pros:
- Uses the faster sfence if available.
- Falls back to "or" for max compatibility.

cons:
- Not consistent with MSVC.
- SFENCE is not necessarily the fastest on all processors.

I ran some timings using x64 on my i7, and this is what I find:

_mm_sfence:  3,589,817,193
lock or   : 14,960,719,245
_mm_mfence: 19,608,594,657

Obviously these results are going to be both highly hw specific and depend heavily on the code surrounding them. Still...

If I were going to pick, I'd probably go with #3. It isn't 100% identical to MSVC, but it effectively produces the same results, and will (at least on current processors) generate faster code.

Opinions?

dw
||
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Mingw-w64-public mailing list
Mingw-w64-public@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to