Thanks @tqchen and @Hzfengsy for your valuable feedbacks. We are trying out
some of your suggestions. Will have further discussions with you after we have
made some evaluations and trials.
> As we know using TensorCores will decrease precision. So, NVIDIA set up a
> switch to turn on and off Te
> * It shocks me that your solution is even faster than CUBLAS and CUDNN. I try
> to reproduce the result but fails. Did you use BatchMatMul and BatchConv? And
> which GPU did you test on? Could you show me the details about the
> performance?
>
Our fp16 TensorCore kernel are tuned on V100 with
This is really impressive work, congrats!
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4105#issuecomment-541259191
> Hi @jianyuh I am getting following error when I try to run my benchmark. It
> gives following error,
>
> ```
> LLVM ERROR: Cannot select: 0x23809ef0: v16i32 = X86ISD::VPDPBUSD 0x210a09a8,
> 0x210a02c0, 0x19eb81b0
> 0x210a09a8: v16i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>,
> Consta
Please join me in welcoming @soiferj as a new reviewer of the Apache TVM
project. Jon has extended TOPI with new operators, extended coverage of ONNX
and TensorFlow Relay frontends, added IR passes to combine dense ops in
parallel among other contributions.
- [Commits](https://github.com/dmlc/
Hi @jianyuh I am getting following error when I try to run my benchmark. It
gives following error,
~~~
LLVM ERROR: Cannot select: 0x23809ef0: v16i32 = X86ISD::VPDPBUSD 0x210a09a8,
0x210a02c0, 0x19eb81b0
0x210a09a8: v16i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>,
Constant:i32<0>, Cons
Thank you for the RFC. It is complete TensorCore support. It is nice that you
can support different types and different data layouts, which is not supported
in my solution currently.
## Lower Passes vs Intrinsic
Intrinsic is a tool for describing what instructions can be done in specific
hardwa
Thanks for the RFC, also cross link to https://github.com/dmlc/tvm/issues/4052.
## Non standard buffer allocation
We are moving toward using special memory scopes to annotate the special
memory(e.g. mma). The use of ```new_expr``` was convenient, but never the less
a bit too close to low level
> Awesome solution! Just curios: for shapes which are worse than cudnn/cublas,
> what kind of tuning is using?
Good point! We do have some internal discussions about whether we need to
automatically search the schedule space based on performance between TensorCore
and non-TensorCore kernel, sin
> Awesome solution! Just curios: for shapes which are worse than cudnn/cublas,
> what kind of tuning is using?
We haven’t spent much effort on performance tuning yet. For cases with bad
performance we plan to do profiling to figure out the causes firstly. One
possible way of optimization is to m
Awesome solution! Just curios: for shapes which are worse than cudnn/cublas,
what kind of tuning is using?
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4105#issuecomment-541014088
#4052 @Hzfengsy
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4105#issuecomment-540978699
We propose a solution for TensorCore CodeGen with significant transparency,
flexibility and usability. In this solution, the algorithm description and
schedule of TensorCore CodeGen is no different than that of a normal CUDA
CodeGen. All the information needed by wmma API, such as
matrix_a/matr
13 matches
Mail list logo