All we need is **a target backend which can emit and optimize intrinsic ir**.

Let's take a look at what we've done in akg, which is a tensor compiler for 
Davinci core based on tvm.

![image|690x317](upload://jTGrmmQAzngMTBGlRI2CgNBry9x.png) 

**Why we do this?**

1) NPU has more SIMD intrinsics than GPU/ARM, but we can not count on LLVM for 
auto vectorization/tensorzation, 
2) Low level LLVM compiler provides a c/c++ & intrinsic languages for users, 
3) and c/c++ & intrinsics is very unfriendly to program, **First** users need 
to learn lots of things related to ISA and target machine, **Sencond**, LLVM 
always treate intrinsics as black boxes, which means users have to optimize 
code manualy.
4) NPU SIMD is more complicated and flexible than tradditinal SIMD. NPU SIMD 
can support move/compute data with mutliple strides, with it you may move 
blocks each instruction. For the same loop nests, we may have different 
configurations when we map it into intrinsics, and different configurations 
means different performance on NPU. It's a big burden for users when they use 
c/c++ & intrinsics directly.
5)also we can do lots of target related optimization here, see them in graph 
above.

For @jcf94 's issue, basically it's the same with ours, except intrinsics of 
ARM/RISCV is much simplier than NPU(just one dimension SIMD). If we want to 
control more details, we should support emiting and optimzing intrinsics in 
TIR. Which means we may have target backends in TIR. If just support normal 
cpu/gpu target, it is enough.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/do-we-have-any-way-to-process-codegen-with-more-fine-grade-control/9908/8)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/077541beea3975fe0ae48ea2f90820c8217cdb7fea694e7fb434937475b4a988).

Reply via email to