I wonder is it possible for TVM to support CUDA warp-level sync operations? For 
example, if I want to use shuffle intrinsics, what should I do? If not 
possible, then I have to use shared memory. But then TVM will generate 
syncthreads, which is an overkill. If I load and consume shared memory only in 
warp, i.e. no memory shared across warps, I don't need syncthreads, right? 

Let's look at the following naive example:
```python
import tvm
from tvm.te import hybrid

@hybrid.script
def demo_sync(indices):
    out = output_tensor((indices.shape[0],), 'int32')
    sm_i = allocate((128,), 'int32', 'shared')
    for b in bind('blockIdx.x', indices.shape[0] // 128):
        for y in range(4):
            for x in bind('threadIdx.x', 32):
                sm_i[y * 4 + x] = indices[b * 128 + y * 4 + x]
                out[y * 4 + x] = sm_i[y * 4 + 31 - x]
            # for i in range(32):
    return out

indices = tvm.te.placeholder((1024,), 'int32', 'indices')
out = demo_sync(indices)
sched = tvm.te.create_schedule(out.op)
f = tvm.build(sched, [indices, out], target='cuda')
print(f.imported_modules[0].get_source())
```
It will generate following cuda code:
```C
extern "C" __global__ void default_function_kernel0(int* __restrict__ indices, 
int* __restrict__ demo_sync) {
  __shared__ int sm_i[128];
  for (int y = 0; y < 4; ++y) {
    __syncthreads();
    sm_i[(((y * 4) + ((int)threadIdx.x)))] = indices[((((((int)blockIdx.x) * 
128) + (y * 4)) + ((int)threadIdx.x)))];
    __syncthreads();
    demo_sync[(((y * 4) + ((int)threadIdx.x)))] = sm_i[((((y * 4) + 31) - 
((int)threadIdx.x)))];
  }
}
```
I think the two syncthreads is not necessary. Is it possible to fix it? Either 
by using shuffle instead, or just do not generate sync operations, or use 
syncwarp.





---
[Visit Topic](https://discuss.tvm.apache.org/t/tvm-cuda-warp-level-sync/8043/1) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/9b50730d800d2620c25497b687d51ca015ffa99382878c7995280256fc00ac6d).

Reply via email to