> I have a proposal to minimize the invasion in TVM and also fundamentally > support TensorCore in TVM. This is in the middle of both methodology of #4052 > and this RFC. > I suppose the current pain point of supporting TensorCore is the data > structure provided by NVIDIA, which introduces non-standard buffer allocation. > I wrote a microbenchmark before to see the generated ptx assembly code, which > turned out that fragment no longer exists after codegen, and the tensorize > intrinsic is just several assembly instructions with 16 operands. > My proposal is that why do not we just extend the intrin and generate the > code in embedded assembly? > @tqchen
Sorry for the late reply. We were occupied by refactoring our implemention to combine with #4052. Generating PTX or even SASS assembly is an intersting topic. We may have some investigations and disscussions on this later. As to the TensorCore CodeGen, I think the data structure is not the only pain point. The root is in the programming model of tensorcore, in which the threads inside a warp are no longer individual threads and some high level information sunch as matrix_a/b, row/col_major, strides of a buffer, is required in low level operations. So I guess generating PTX directly may not release these pains. @Hzfengsy what do you think about this? -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/4105#issuecomment-548315125