@tqchen The problem arises because LLVM codegen is not able to use suitable instructions. A fixed point multiply at Relay level will have to upcast the input tensors to int64. ARM instructions that @giuseros shared take int32 tensors and perform the upcasting internally in the HW (please correct me if I am wrong - @giuseros). Therefore, today QNN/Relay graphs do not use the best possible ARM instructions.
At the same time, I have similar concerns about overkill. I earlier missed this, but having a new op disallows operator fusion, leading to 1.5% speedup instead of 3% speedup. --- [Visit Topic](https://discuss.tvm.ai/t/rfc-using-arm-intrinsics-to-implement-fixed-point-multiplication-in-tvm/7150/6) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/b7ac0bcacf896c18ba1847c7fe71e34197572edf37e3e381b85172ed06b357b1).