Hi xiaocenxiaocen,
Thanks. I will follow up this paper.
Best wishes,
Shawn Wu
---
[Visit
Topic](https://discuss.tvm.ai/t/rfc-tensor-core-optimization-of-winograd-conv2d-on-tensor-core/6543/3)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from
Hi @Novice ,
Yes, I agree that TVM on Tensor Core GPUs do have a lot of room to optimize.
Currently we are optimizing the data path between global memory and registers,
and we think this is a major bottleneck. We are trying to experiment on
different layout of both feature maps and weights.
Updated design details
# Details on legalization
Since most of the HW has no native support for computation on bf16, we added a
pass `BF16Legalization` to use fp32 computing bf16 data. It has 3 sub-passes:
`Promotion`, `Elimilination` and `Lowering`.
## BF16Promotion
It adds `cast_to_fp32()
For winograd impl with large batch size, maybe you can refer to this paper
https://dl.acm.org/doi/pdf/10.1145/3332466.3374520.
They implement an assembler for Volta/Turing architecture and use CHWN layout
for large batch winograd algorithm.
---
[Visit
Topic](https://discuss.tvm.ai/t/rf
Hi, @Hzfengsy @Shawn_Inspur :slightly_smiling_face:
Thanks for your efforts on supporing TensorCore on TVM.
I have tuned TensorCore on classical network such as resnet50 & vgg16(32
batch_size). And the tensor_precision_fu_utilization reported by Nvprof shows
that I got a Mid/Low utilization o
[quote="hcho3, post:11, topic:4341"]
Your post saved lots of time for m
[/quote]
@hcho3
I compiled LLVM from the github and lld-link.exe was produced.But I still have
an error running the code:
```
RuntimeError: Can not find cl.exe,please run this in Vistual Studio Command
Prompt
```
I also us