On 11/14/2014 04:12 AM, Bernd Schmidt wrote:

>>>>     - we'll need some synchronization primitives, I see atomic
>>>> support is
>>>>       there, we need mutexes and semaphores I think, is that
>>>> implementable
>>>>       using bar instruction?
>>>
>>> It's probably membar you need.
>>
>> That is a memory barrier, I need threads to wait on each other, wake
>> up one
>> another etc.
> 
> Hmm. It's worthwhile to keep in mind that GPU threads really behave
> somewhat differently from CPUs (they don't really execute
> independently); the OMP model may just be a poor match for the
> architecture in general.
> One could busywait on a spinlock, but AFAIK there isn't really a way to
> put a thread to sleep. By not executing independently, I mean this: I
> believe if one thread in a warp is waiting on the spinlock, all the
> other ones are also busywaiting. There may be other effects that seem
> odd if one approaches it from a CPU perspective - for example you
> probably want only one thread in a warp to try to take the spinlock.

Thread synchronization in CUDA is different from conventional CPUs.
Using the gang/thread terminology, there's no way to synchronize two
threads in two different gangs in PTX without invoking separate kernels.
Basically, after a kernel is invoked, the host/accelerator (the later
using dynamic parallelism) waits for the kernel to finish, and that
effectively creates a barrier.

PTX does have an intra-gang synchronization primitive, which is helpful
if the control flow diverges within a gang. Also, unless I'm mistaken,
the PTX atomic operations only work within a gang.

Also, keep in mind that PTX doesn't have a global TID. The user needs to
calculate it using ctaid/tid and friends.

Cesar

Reply via email to