I got very close to matching PyTorch's bmm on Vega 20 (Radeon VII) and to about to 1.5x on 1080Ti for the 1024 example (with fixed dims).
One of the limiting things on the path ahead is the "-1" issue in the output configurations of course. Best regards Thomas --- [Visit Topic](https://discuss.tvm.ai/t/optimizing-matrix-multiplication-for-gpu/4212/26) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/63020d8b5ed6e9fdcc88dfe4ddf6688d614ef1abd2cacb1a026a893f15134c7c).