On 11/14/2014 04:12 AM, Bernd Schmidt wrote: >>>> - we'll need some synchronization primitives, I see atomic >>>> support is >>>> there, we need mutexes and semaphores I think, is that >>>> implementable >>>> using bar instruction? >>> >>> It's probably membar you need. >> >> That is a memory barrier, I need threads to wait on each other, wake >> up one >> another etc. > > Hmm. It's worthwhile to keep in mind that GPU threads really behave > somewhat differently from CPUs (they don't really execute > independently); the OMP model may just be a poor match for the > architecture in general. > One could busywait on a spinlock, but AFAIK there isn't really a way to > put a thread to sleep. By not executing independently, I mean this: I > believe if one thread in a warp is waiting on the spinlock, all the > other ones are also busywaiting. There may be other effects that seem > odd if one approaches it from a CPU perspective - for example you > probably want only one thread in a warp to try to take the spinlock.
Thread synchronization in CUDA is different from conventional CPUs. Using the gang/thread terminology, there's no way to synchronize two threads in two different gangs in PTX without invoking separate kernels. Basically, after a kernel is invoked, the host/accelerator (the later using dynamic parallelism) waits for the kernel to finish, and that effectively creates a barrier. PTX does have an intra-gang synchronization primitive, which is helpful if the control flow diverges within a gang. Also, unless I'm mistaken, the PTX atomic operations only work within a gang. Also, keep in mind that PTX doesn't have a global TID. The user needs to calculate it using ctaid/tid and friends. Cesar