Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

Zhao Wu Wed, 10 Jun 2020 19:56:44 -0700

Thanks for the great work! I have some quick question:
1. Have you tested various models arm cpu? (like A53, A72, A55, A75 and so on). 
According to fb qnnpack blog, it is not always could get best performance using 
`umul` compared with `smlal` instruction (used by now). 
(https://engineering.fb.com/ml-applications/qnnpack/). So just change 
legalization and give it up `smlal` instruction in aarch64 maybe doesn't make 
sense to me. One proof:  our coming feature `Ansor` (auto scheduler) doesn't 
support tensorize (at least till now), however, it could get nice performance 
using `smlal` instruction and beyond TFLite 1.2X on mobilenet v2 quantized 
model (cortex-a53) 
(https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4).
 I mean here:
```python
@qnn_conv2d_legalize.register('arm_cpu')
def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
    # ARM prefers the dtypes to be same.
    if is_aarch64_arm():
        return helper_change_dtypes_to_be_same(attrs, inputs, types, 
relay.qnn.op.conv2d)
    return helper_no_fast_int8_hw_legalization(attrs, inputs, types, 
relay.qnn.op.conv2d)
```
It disables us using `SMLAL` instruction.


2. I suggest we keep two schedules (tensorize and default spatial pack). Not 
just check `aarch64` and only use tensorize template. I mean here:
```python
is_aarch64 = "aarch64" in str(isa.target)
if is_aarch64 and data.dtype in ["int8", "uint8"]:
    strategy.add_implementation(
        wrap_compute_conv2d(topi.arm_cpu.compute_conv2d_NHWC_quantized),
        wrap_topi_schedule(topi.arm_cpu.schedule_conv2d_NHWC_quantized),
        name="compute_conv2d_NHWC_quantized.arm_cpu")
else:
    strategy.add_implementation(
        wrap_compute_conv2d(topi.arm_cpu.conv2d_nhwc_spatial_pack),
        wrap_topi_schedule(topi.arm_cpu.schedule_conv2d_nhwc_spatial_pack),
        name="conv2d_nhwc_spatial_pack.arm_cpu")
```

This is our design purpose of strategy. I suspect there is some workload our 
spatial pack could perform better. This situation is the same as Winograd, we 
could perform winograd and default template and choose better. 

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642374967

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

Reply via email to