So, a question about __faststorefence. The current implementation in
winnt.h is incorrect. I have 3 alternates to propose, and which one is
"best" depends on the goals of the mingw-w64 project. One approach is
"just do what MSVC does." However, there's also something to be said
for "generate the fastest possible code." And for completeness, there's
also "use a built-in." Details with pros/cons below.
First, the current code:
__MINGW_INTRIN_INLINE void __faststorefence(void) {
__asm__ __volatile__ ("" ::: "memory");
}
While the "memory" clobber generates a readwritebarrier() for the
compiler, __faststorefence must
<http://msdn.microsoft.com/en-us/library/t710k390%28v=VS.80%29.aspx>
also generate some type of fence instruction for the processor (sfence,
mfence, lock).
So, we can:
1) Just map this to __sync_synchronize
<http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/_005f_005fsync-Builtins.html#_005f_005fsync-Builtins>().
This does the full memory compiler barrier, and generates an mfence
instruction.
pros:
- Uses builtin.
cons:
- Generates mfence instead of sfence (see timing #'s below).
- Generates mfence even if compiled with -mno-sse (mfence is sse2).
- Generates mfence instead of the "|lock or DWORD PTR [rsp], 0"| which
MSVC generates.
2) Map this to the same as MSVC. The "memory" clobber ensures the
compiler barrier, and the "lock" provides the fence:
asm ("lock or %[zero], (%%rsp)" :: [zero] "ri" (0) : "memory", "cc")
pros:
- consistent with MSVC.
cons:
- While sfence may have been slower when first introduced, it's faster
than "or" now (see #'s below).
3) Use code like:
__MINGW_INTRIN_INLINE void x__faststorefence(void) {
#ifdef __SSE__ // defined by gcc when sse instructions are available
asm ("sfence" ::: "memory");
#else
asm ("lock or %[zero], (%%rsp)" :: [zero] "ri" (0) : "memory", "cc");
#endif
}
Pros:
- Uses the faster sfence if available.
- Falls back to "or" for max compatibility.
cons:
- Not consistent with MSVC.
- SFENCE is not necessarily the fastest on all processors.
I ran some timings using x64 on my i7, and this is what I find:
_mm_sfence: 3,589,817,193
lock or : 14,960,719,245
_mm_mfence: 19,608,594,657
Obviously these results are going to be both highly hw specific and
depend heavily on the code surrounding them. Still...
If I were going to pick, I'd probably go with #3. It isn't 100%
identical to MSVC, but it effectively produces the same results, and
will (at least on current processors) generate faster code.
Opinions?
dw
||
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Mingw-w64-public mailing list
Mingw-w64-public@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public