TVM has a warp memory abstraction. If you use `allocate((128,), 'int32', 'warp')`, TVM will put the data in thread local registers and then use shuffle operations to make the data available to other threads in the warp. Out can also use the shuffles directly if you want. I'm not sure how exactly to use warp shuffles in hybrid script, but you can grep the codebase for `tvm_warp_shuffle`.
--- [Visit Topic](https://discuss.tvm.apache.org/t/tvm-cuda-warp-level-sync/8043/2) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/a71411164d5bfecc86c87886e94fbc4366b03757f2067e6f4801cb3ddb4438c4).