@anijain2305 Generally Good. About the performance of HW, let us say ARM CPU, 
For the depthwise convolution, we even could optimize without tensorize. After 
some work of optimization for int8 using pure TVM schedule without tensorize, 
we could also beat QNNPACK (some workload we test we even could beyond 50%).

 However, for normal convolution, without tensorize, it is hard to achieve best 
performance. When we use tensorize, one thing is that we combine `bias_add` 
into `qnn.conv2d` to avoid memory access. As @jackwish 's previous 
investigation, we find it is very important on ARM CPU's performance. So, if we 
implement it as the diagram, I only concern this thing.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-508657963

Reply via email to