2013/6/3 dw <limegreenso...@yahoo.com>: > So, a question about __faststorefence. The current implementation in > winnt.h is incorrect. I have 3 alternates to propose, and which one is > "best" depends on the goals of the mingw-w64 project. One approach is "just > do what MSVC does." However, there's also something to be said for > "generate the fastest possible code." And for completeness, there's also > "use a built-in." Details with pros/cons below. > > First, the current code: > > __MINGW_INTRIN_INLINE void __faststorefence(void) { > __asm__ __volatile__ ("" ::: "memory"); > } > > While the "memory" clobber generates a readwritebarrier() for the compiler, > __faststorefence must also generate some type of fence instruction for the > processor (sfence, mfence, lock). > > So, we can: > > 1) Just map this to __sync_synchronize(). This does the full memory > compiler barrier, and generates an mfence instruction. > pros: > - Uses builtin. > > cons: > - Generates mfence instead of sfence (see timing #'s below). > - Generates mfence even if compiled with -mno-sse (mfence is sse2). > - Generates mfence instead of the "lock or DWORD PTR [rsp], 0" which MSVC > generates. > > 2) Map this to the same as MSVC. The "memory" clobber ensures the compiler > barrier, and the "lock" provides the fence: > > asm ("lock or %[zero], (%%rsp)" :: [zero] "ri" (0) : "memory", "cc") > > pros: > - consistent with MSVC. > > cons: > - While sfence may have been slower when first introduced, it's faster than > "or" now (see #'s below). > > 3) Use code like: > > __MINGW_INTRIN_INLINE void x__faststorefence(void) { > > #ifdef __SSE__ // defined by gcc when sse instructions are available > asm ("sfence" ::: "memory"); > #else > asm ("lock or %[zero], (%%rsp)" :: [zero] "ri" (0) : "memory", "cc"); > #endif > > } > > Pros: > - Uses the faster sfence if available. > - Falls back to "or" for max compatibility. > > cons: > - Not consistent with MSVC. > - SFENCE is not necessarily the fastest on all processors. > > I ran some timings using x64 on my i7, and this is what I find: > > _mm_sfence: 3,589,817,193 > lock or : 14,960,719,245 > _mm_mfence: 19,608,594,657 > > Obviously these results are going to be both highly hw specific and depend > heavily on the code surrounding them. Still... > > If I were going to pick, I'd probably go with #3. It isn't 100% identical > to MSVC, but it effectively produces the same results, and will (at least on > current processors) generate faster code. > > Opinions? > > dw
I think option #3 is that one I prefer too. Just one thing about SSE-instruction. For 64-bit we can assume that SSE has to be present in any case. Just for 32-bit we should check in headers for the __SSE__ macro, and in the intrinsic-function (none-inline) we should default to none-SSE version. Kai ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 _______________________________________________ Mingw-w64-public mailing list Mingw-w64-public@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mingw-w64-public