> I have a proposal to minimize the invasion in TVM and also fundamentally 
> support TensorCore in TVM. This is in the middle of both methodology of #4052 
> and this RFC.
> I suppose the current pain point of supporting TensorCore is the data 
> structure provided by NVIDIA, which introduces non-standard buffer allocation.
> I wrote a microbenchmark before to see the generated ptx assembly code, which 
> turned out that fragment no longer exists after codegen, and the tensorize 
> intrinsic is just several assembly instructions with 16 operands.
> My proposal is that why do not we just extend the intrin and generate the 
> code in embedded assembly?
> @tqchen

Sorry for the late reply. We were occupied by refactoring our implemention to 
combine with #4052.
Generating PTX or even SASS assembly is an intersting topic. We may have some 
investigations and disscussions on this later. As to the TensorCore CodeGen, I 
think the data structure is not the only pain point. The root is in the 
programming model of tensorcore, in which the threads inside a warp are no 
longer individual threads and some high level information sunch as matrix_a/b, 
row/col_major, strides of a buffer, is required in low level operations. So I 
guess generating PTX directly may not release these pains. @Hzfengsy what do 
you think about this?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4105#issuecomment-548315125

Reply via email to