I have a proposal to minimize the invasion in TVM and also fundamentally 
support TensorCore in TVM. This is in the middle of both methodology of #4052 
and this RFC.
I suppose the current pain point of supporting TensorCore is the data structure 
provided by NVIDIA, which introduces non-standard buffer allocation.
I wrote a microbenchmark before to see the generated ptx assembly code, which 
turned out that fragment no longer exists after codegen, and the tensorize 
intrinsic is just several assembly instructions with 16 operands.
My proposal is that why do not we just extend the intrin and generate the code 
in embedded assembly?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4105#issuecomment-544045155

Reply via email to