# Introduction and motivation

Mathematically, the fixed point multiplication (FPM) can be described as:

`fpm(x,m,s) = round(x*m*2^(s-31))`

In this expression:

* `x` is the quantized value to multiply, and `m` and `s` [are an integer 
multiplier and a shift](https://arxiv.org/pdf/1712.05877.pdf). 

* The function `round` can be any of the rounding rules described 
[here](https://arxiv.org/pdf/1712.05877.pdf).

FPM is at the heart of the requantization process in quantized neural networks 
(QNNs), where the 32-bit integers resulting from a convolution or GEMM need to 
be requantized to a narrower data type (usually int8 or uint8). 

Our analysis shows that we can achieve up to a 3% improvement, on Arm targets, 
by speeding up FPM. Even though it might not seem a lot, in a previous 
[RFC](https://discuss.tvm.ai/t/rfc-improve-quantized-convolution-performance-for-armv8-architectures/6920)
 we showed that we are now 4% away from frameworks like TFlite, so even a tiny 
3% improvement is appealing for us. 

# Background

In its current state, TVM implements FPM as a [sequence of relay 
operators.](https://github.com/apache/incubator-tvm/blob/master/src/relay/qnn/util.cc#L78)
 The pseudo-code is showed below:

```
def fixed_point_multiply(x, fixed_point_multiplier, right_shift) 
x = cast(x,int64) * fixed_point_multiplier 
total_right_shift = right_shift + 31 
pos_rounding_value = 1 << (total_right_shift -1) 
x = x + pos_rounding_value 
x = x >> total_right_shift return cast(x, int32)
```

* All the operators (shift, sum, multiplication) are Relay operators
* All the computation is mostly carried in 64 bits, converting to 32 bits only 
at the very end of the FPM operator and is very close to the mathematical 
expression described above
* TVM picks a `to-nearest` rounding rule and breaks ties upward (i.e., `x.5` 
becomes `x+1`). 
* The Relay implementation also considers the case of a negative right shift 
(not showed in the pseudo-code)

However, architectures like Armv8-A provide interesting instructions to execute 
this operation directly in 32 bits.  In particular, [it can be 
shown](https://arxiv.org/pdf/1712.05877.pdf) that this operation can be 
achieved (on Armv8-A targets) as a combination of 
[sqrdmulh](https://developer.arm.com/docs/dui0802/a/a64-advanced-simd-scalar-instructions/sqrdmulh-scalar)
 and 
[srshl](https://developer.arm.com/docs/dui0802/a/a64-advanced-simd-scalar-instructions/srshl-scalar)
 instructions (which indeed operate on 32bits quads). In particular:

* `sqrdmulh(a,b)` : executes `((a*b*2)+round_const) * 2^(-31)`. Note that the 
`round_const` is used to round to nearest breaking ties upward
* `srshl(a,n)` : executes `a*2^(-n)`, rounding always upward (this means we 
need to nudge the result to round to-nearest).

# Design and implementation

We propose to create a TVM intrinsic `fixed_point_multiply` written in TVM IR 
(TIR). In this way:

* The intrinsic can be overloaded by different targets using 
`tvm.target.intrin.register_intrin_rule`
* Each hardware vendor can provide an hardware specific implementation of the 
operation

In the sections below, we describe the main code changes of this RFC. 

**Relay changes**

We created a new Relay operator `fixed_point_multiplication` and registered a 
compute and an `injective_schedule` for it.

* The Relay operator has two attributes, the multiplier (`m`) and the right 
shift(`s`)
* The compute is a simple loop over the array (i.e., mostly like a unary 
operation)
* The injective schedule has the task to vectorize the loop.

**TIR changes**

The main TIR changes are the following:

* We registered a `tvm.intrin.rule.default.fixed_point_multiply` TVM intrinsic 
that executes the same operations or the Relay implementation(but using TIR 
operators). 
* We created a TIR operator `fixed_point_multiply(x,m,s) which executes the 
call:

`call_intrin(x.dtype, "fixed_point_multiply", x, m, s)`

**Intrinsic overload**

In order to overload the intrinsic for Armv8-A we need to make use of 
`tvm.target.intrin.register_intrin_rule`. However,  the intrinsics are 
overloaded by `target_name` which in case of Armv8-A is only `llvm`.

This means that, in order to specialize for `llvm.aarch64` we had to hack into 
[lower_intrin.cc](https://github.com/apache/incubator-tvm/blob/eafb2aa13d6cd223629f17d5f6aab5a8d4fce7f5/src/tir/transforms/lower_intrin.cc#L43-L51)
 and register a new `llvm.intrin.rule.aarch64.` pattern. 

Given the above tweak, we could easily exploit the 
`tvm.target.intrin.register_intrin_rule` method in order to register a version 
of `fixed_point_multiply` tailored for Armv8-A ISA. The result is similar to 
the following:

def _fixed_point_multiply_arm(op): """ Implementation of fixed point 
multiplication through arm intrinsics sqrdmulh and srshl """ x = op.args[0] 
multiplier = op.args[1] shift = op.args[2] # Don't use this intrinsic if we 
don't have a int32x4 vector if x.dtype != "int32x4": return op # Case 1, shift 
is negative sqrdmulh = tvm.tir.call_llvm_intrin(op.dtype, 
'llvm.aarch64.neon.sqrdmulh', tvm.tir.const(2, 'uint32'), x, multiplier) fixup 
= (sqrdmulh & (-shift)) >> 31 fixed_up_x = (sqrdmulh + fixup) out = 
tvm.tir.call_llvm_intrin(op.dtype, 'llvm.aarch64.neon.srshl', tvm.tir.const(2, 
'uint32'), sqrdmulh, shift) return out 
tvm.target.intrin.register_intrin_rule("llvm.aarch64", "fixed_point_multiply", 
_fixed_point_multiply_arm, override=True)

Few notes on the above implementation:

* Please note that we also consider the case of a negative right shift (not 
showed in the code)
* The fixup is needed to round to nearest (instead of rounding upward as 
`srshl` does)
* We decided to use the default implementation when the data (`x`) is not a 
vector

# Final notes on performance and precision

**Performance**

As previously mentioned, the best performance gain in using those intrinsics 
seems to set around 3%, but the performance improvement we got is only around 
1.5%:

* The 3% improvement is for a case in which the requantization operation is 
fused within the main computation loop (e.g., GEMM or spatial convolution). 
* In TVM,  [a quantized convolution is lowered  as a sequence of a qnn.conv2d 
followed by a requantize 
operator](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651/17).
 This makes fusing requantization within the compute not possible, explaining 
why we cannot fully achieve the 3% improvement.

**Precision**

There are corner cases in which the intrinsic implementation will have a +1/-1 
error compared to the default TVM implementation. This is because all the 
computation is done in 32bits (as opposed to the default 64 bits 
implementation) introducing a rounding error for some edge cases.

# PR

The PR for this RFC is being cleaned and will be soon submitted on github





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-using-arm-intrinsics-to-implement-fixed-point-multiplication-in-tvm/7150/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/355adb2a44116d8feb999b69718e4542081246dd9b872876845a4a4428535f60).

Reply via email to