We recently released a transformer model for long documents that is powered by 
a custom CUDA kernel implemented in TVM 
([Here's](https://twitter.com/ApacheTVM/status/1249883784410873856) TVM account 
tweeting about it). 

Would anyone be interested in implementing a faster schedule for the kernel? I 
think it will be a great showcase for the usability and efficiency of TVM, and 
can have a big impact on the NLP community. 

In case anyone is interested, here is some background: 

- The kernel is a form of banded matrix multiplication where we only compute 
certain diagonals of the output matrix (check figures 2.b, 2.c in the 
[paper](https://arxiv.org/pdf/2004.05150.pdf)).

- Our schedule 
[here](https://github.com/allenai/longformer/blob/master/longformer/diagonaled_mm_tvm.py#L85)
 is 16x slower than it should be.

- the `batched_matmul` schedule 
[here](https://github.com/facebookexperimental/tvm/blob/master/topi/python/topi/cuda/batch_matmul.py#L71)
 is 2x faster than ours for the setting in figure 2.b (I will use it instead of 
our schedule for this case), and it is much worse than our schedule for the 
setting in figure 2.c. 

So the question is if we can implement a schedule faster than ours and 
`batched_matmul`. If anyone is interested in working on this, please let me 
know. 

Thanks





---
[Visit 
Topic](https://discuss.tvm.ai/t/developing-a-faster-schedule-for-longformers-kernel/6367/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/f05da8488cada68101567c83b3cc8b2abb9082f9681f1c2087d38e5b6943e67b).

Reply via email to