We recently released a transformer model for long documents that is powered by a custom CUDA kernel implemented in TVM ([Here's](https://twitter.com/ApacheTVM/status/1249883784410873856) TVM account tweeting about it).
Would anyone be interested in implementing a faster schedule for the kernel? I think it will be a great showcase for the usability and efficiency of TVM, and can have a big impact on the NLP community. In case anyone is interested, here is some background: - The kernel is a form of banded matrix multiplication where we only compute certain diagonals of the output matrix (check figures 2.b, 2.c in the [paper](https://arxiv.org/pdf/2004.05150.pdf)). - Our schedule [here](https://github.com/allenai/longformer/blob/master/longformer/diagonaled_mm_tvm.py#L85) is 16x slower than it should be. - the `batched_matmul` schedule [here](https://github.com/facebookexperimental/tvm/blob/master/topi/python/topi/cuda/batch_matmul.py#L71) is 2x faster than ours for the setting in figure 2.b (I will use it instead of our schedule for this case), and it is much worse than our schedule for the setting in figure 2.c. So the question is if we can implement a schedule faster than ours and `batched_matmul`. If anyone is interested in working on this, please let me know. Thanks --- [Visit Topic](https://discuss.tvm.ai/t/developing-a-faster-schedule-for-longformers-kernel/6367/1) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/f05da8488cada68101567c83b3cc8b2abb9082f9681f1c2087d38e5b6943e67b).