Problem in a nutshell: pthread_getspecific serializes execution of threads using it. Without a better implementation my program doesn't work on OpenBSD.
I am trying to port Cilk+ to OpenBSD (5.3). Cilk+ is a multithreaded extension to C/C++. It adds some bookkeeping operations in the prologue of some functions. In many Cilk programs most function calls do this bookkeeping (dynamic count, not static). The per-call bookkeeping calls pthread_getspecific. pthread_getspecific takes out a spinlock. The lock is apparently needed in case of a race with pthread_key_delete. This is unlikely to happen, but I suppose it is possible. Every function call in this multithreaded program serializes waiting on the lock. Also, the cache line with the lock is constantly moving between processors. This is worse than useless for Cilk. You're much better off with a single threaded program. An older version of Cilk used a thread local storage class (__thread). If memory serves, the switch to pthread_getspecific was driven by a few considerations: 1. Thread local variables don't get along well with shared libraries. 2. Thread local variables are less portable. OpenBSD doesn't really support them, for example. They are emulated with pthread_getspecific. 3. On Linux/x86 pthread_getspecific is very fast, essentially a move instruction with a segment override. It seems to me the implementation of pthread_getspecific doesn't need to be as slow as it is. It ought to be possible to have multiple readers be always nonblocking as long as the key list doesn't change, and possibly even if it does change. pthread_getspecific only needs a read lock rather than a mutex. The rwlock in librthread starts with a spinlock, so it's not the answer. Any thoughts?