As Stefan suggested, I have posted a PR for a better fix for the issue when MinGW for some reason emits the symbol "__synch_fetch_and_add_4" instead of generating atomic opcode for the __synch_fetch_and_add builtin.

The PR is here:
https://github.com/cython/cython/pull/185

The discussion probably belongs on this list instead og Cython user:

The problem this addresses is when GCC does not use atomic builtins and emits __synch_fetch_and_add_4 and __synch_fetch_and_sub_4 when Cython are internally refcounting memoryview buffers. For some reason it can even happen on x86 and amd64.

My PR undos Marks quick fix that always uses PyThread_acquire_lock on MinGW. PyThread_acquire_lock uses a kernel object (semaphore) on Windows and is not very efficient. I want slicing memoryviews to be fast, and that means PyThread_acquire_lock must go. My PR uses Windows API atomic function InterlockedAdd to implement the semantics of __synch_fetch_and_add_4 and __synch_fetch_and_sub_4 instead of using a Python lock.

Usually MinGW is configured to compile GNU atomic builtins correctly. I have yet to see a case where it is not. But obviously one user (JF Gallant) has encountered it. I don't think it is a MinGW specific problem, but currently it has only been seen on MinGW and the fix is MinGW specific (well, it should work on Cygwin too). But whenever MinGW does use atomic builtins it just uses them. So it incurs no speed penalty on well-behaved MinGW builds.

I took the liberty to use GNU extensions __inline__ and __attribute(always_inline)__. They will make sure the functions always behave like macros. The rationale being that it is GCC specific code so we can assume GNU extensions are available. If we take them away the code should still work, but we have no guarantee the functions will be inlined. I did not use macros because __synch_fetch_and_add is emitted by the preprocessor, and thus GCC will presumably emit __synch_fetch_and_sub_4 after the preprocessing step, which could require __synch_fetch_and_sub_4 to be a function instead of another macro. (I have no way of finding it out since I cannot test for it.)



Regarding Linux and OSX:

Failure of GCC to use atomic builtins could also happen on other GCC builds though. I don't think it is a MinGW-only issue. It's probably due to how the GCC build was configured. So we should as a safeguard have this for other OSes too.

http://developer.apple.com/library/ios/#DOCUMENTATION/System/Conceptual/ManPages_iPhoneOS/man3/OSAtomicAdd32.3.html

We probably just need similar code to what I wrote for MinGW. I can write the code, but I don't have a Mac on which to test it.

Also we should use OSAtomic* on clang/LLVM, which is now the platform C compiler on OSX. This will avoid PyThread_acquire_lock being the common synch mechanism for refcounting memoryview buffers on OSX.

On Linux I am not sure what to suggest if GCC fails to use atomic builtins. I can handcode inline assembly for x86/amd64. I could also use pthreads and pth threads locks. But we could also assume that it never happen and just let the linker fail on __synch_fetch_and_add_4.



Sturla
_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Reply via email to