I wonder is it possible for TVM to support CUDA warp-level sync operations? For example, if I want to use shuffle intrinsics, what should I do? If not possible, then I have to use shared memory. But then TVM will generate syncthreads, which is an overkill. If I load and consume shared memory only in warp, i.e. no memory shared across warps, I don't need syncthreads, right?
Let's look at the following naive example: ```python import tvm from tvm.te import hybrid @hybrid.script def demo_sync(indices): out = output_tensor((indices.shape[0],), 'int32') sm_i = allocate((128,), 'int32', 'shared') for b in bind('blockIdx.x', indices.shape[0] // 128): for y in range(4): for x in bind('threadIdx.x', 32): sm_i[y * 4 + x] = indices[b * 128 + y * 4 + x] out[y * 4 + x] = sm_i[y * 4 + 31 - x] # for i in range(32): return out indices = tvm.te.placeholder((1024,), 'int32', 'indices') out = demo_sync(indices) sched = tvm.te.create_schedule(out.op) f = tvm.build(sched, [indices, out], target='cuda') print(f.imported_modules[0].get_source()) ``` It will generate following cuda code: ```C extern "C" __global__ void default_function_kernel0(int* __restrict__ indices, int* __restrict__ demo_sync) { __shared__ int sm_i[128]; for (int y = 0; y < 4; ++y) { __syncthreads(); sm_i[(((y * 4) + ((int)threadIdx.x)))] = indices[((((((int)blockIdx.x) * 128) + (y * 4)) + ((int)threadIdx.x)))]; __syncthreads(); demo_sync[(((y * 4) + ((int)threadIdx.x)))] = sm_i[((((y * 4) + 31) - ((int)threadIdx.x)))]; } } ``` I think the two syncthreads is not necessary. Is it possible to fix it? Either by using shuffle instead, or just do not generate sync operations, or use syncwarp. --- [Visit Topic](https://discuss.tvm.apache.org/t/tvm-cuda-warp-level-sync/8043/1) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/9b50730d800d2620c25497b687d51ca015ffa99382878c7995280256fc00ac6d).