I have a proposal to minimize the invasion in TVM and also fundamentally support TensorCore in TVM. This is in the middle of both methodology of #4052 and this RFC. I suppose the current pain point of supporting TensorCore is the data structure provided by NVIDIA, which introduces non-standard buffer allocation. I wrote a microbenchmark before to see the generated ptx assembly code, which turned out that fragment no longer exists after codegen, and the tensorize intrinsic is just several assembly instructions with 16 operands. My proposal is that why do not we just extend the intrin and generate the code in embedded assembly?
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/4105#issuecomment-544045155