@anijain2305 Generally Good. About the performance of HW, let us say ARM CPU, For the depthwise convolution, we even could optimize without tensorize. After some work of optimization for int8 using pure TVM schedule without tensorize, we could also beat QNNPACK (some workload we test we even could beyond 50%).
However, for normal convolution, without tensorize, it is hard to achieve best performance. When we use tensorize, one thing is that we combine `bias_add` into `qnn.conv2d` to avoid memory access. As @jackwish 's previous investigation, we find it is very important on ARM CPU's performance. So, if we implement it as the diagram, I only concern this thing. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/2351#issuecomment-508657963