All we need is **a target backend which can emit and optimize intrinsic ir**.
Let's take a look at what we've done in akg, which is a tensor compiler for Davinci core based on tvm.  **Why we do this?** 1) NPU has more SIMD intrinsics than GPU/ARM, but we can not count on LLVM for auto vectorization/tensorzation, 2) Low level LLVM compiler provides a c/c++ & intrinsic languages for users, 3) and c/c++ & intrinsics is very unfriendly to program, **First** users need to learn lots of things related to ISA and target machine, **Sencond**, LLVM always treate intrinsics as black boxes, which means users have to optimize code manualy. 4) NPU SIMD is more complicated and flexible than tradditinal SIMD. NPU SIMD can support move/compute data with mutliple strides, with it you may move blocks each instruction. For the same loop nests, we may have different configurations when we map it into intrinsics, and different configurations means different performance on NPU. It's a big burden for users when they use c/c++ & intrinsics directly. 5)also we can do lots of target related optimization here, see them in graph above. For @jcf94 's issue, basically it's the same with ours, except intrinsics of ARM/RISCV is much simplier than NPU(just one dimension SIMD). If we want to control more details, we should support emiting and optimzing intrinsics in TIR. Which means we may have target backends in TIR. If just support normal cpu/gpu target, it is enough. --- [Visit Topic](https://discuss.tvm.apache.org/t/do-we-have-any-way-to-process-codegen-with-more-fine-grade-control/9908/8) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/077541beea3975fe0ae48ea2f90820c8217cdb7fea694e7fb434937475b4a988).