# Introduction and motivation
Mathematically, the fixed point multiplication (FPM) can be described as: `fpm(x,m,s) = round(x*m*2^(s-31))` In this expression: * `x` is the quantized value to multiply, and `m` and `s` [are an integer multiplier and a shift](https://arxiv.org/pdf/1712.05877.pdf). * The function `round` can be any of the rounding rules described [here](https://arxiv.org/pdf/1712.05877.pdf). FPM is at the heart of the requantization process in quantized neural networks (QNNs), where the 32-bit integers resulting from a convolution or GEMM need to be requantized to a narrower data type (usually int8 or uint8). Our analysis shows that we can achieve up to a 3% improvement, on Arm targets, by speeding up FPM. Even though it might not seem a lot, in a previous [RFC](https://discuss.tvm.ai/t/rfc-improve-quantized-convolution-performance-for-armv8-architectures/6920) we showed that we are now 4% away from frameworks like TFlite, so even a tiny 3% improvement is appealing for us. # Background In its current state, TVM implements FPM as a [sequence of relay operators.](https://github.com/apache/incubator-tvm/blob/master/src/relay/qnn/util.cc#L78) The pseudo-code is showed below: ``` def fixed_point_multiply(x, fixed_point_multiplier, right_shift) x = cast(x,int64) * fixed_point_multiplier total_right_shift = right_shift + 31 pos_rounding_value = 1 << (total_right_shift -1) x = x + pos_rounding_value x = x >> total_right_shift return cast(x, int32) ``` * All the operators (shift, sum, multiplication) are Relay operators * All the computation is mostly carried in 64 bits, converting to 32 bits only at the very end of the FPM operator and is very close to the mathematical expression described above * TVM picks a `to-nearest` rounding rule and breaks ties upward (i.e., `x.5` becomes `x+1`). * The Relay implementation also considers the case of a negative right shift (not showed in the pseudo-code) However, architectures like Armv8-A provide interesting instructions to execute this operation directly in 32 bits. In particular, [it can be shown](https://arxiv.org/pdf/1712.05877.pdf) that this operation can be achieved (on Armv8-A targets) as a combination of [sqrdmulh](https://developer.arm.com/docs/dui0802/a/a64-advanced-simd-scalar-instructions/sqrdmulh-scalar) and [srshl](https://developer.arm.com/docs/dui0802/a/a64-advanced-simd-scalar-instructions/srshl-scalar) instructions (which indeed operate on 32bits quads). In particular: * `sqrdmulh(a,b)` : executes `((a*b*2)+round_const) * 2^(-31)`. Note that the `round_const` is used to round to nearest breaking ties upward * `srshl(a,n)` : executes `a*2^(-n)`, rounding always upward (this means we need to nudge the result to round to-nearest). # Design and implementation We propose to create a TVM intrinsic `fixed_point_multiply` written in TVM IR (TIR). In this way: * The intrinsic can be overloaded by different targets using `tvm.target.intrin.register_intrin_rule` * Each hardware vendor can provide an hardware specific implementation of the operation In the sections below, we describe the main code changes of this RFC. **Relay changes** We created a new Relay operator `fixed_point_multiplication` and registered a compute and an `injective_schedule` for it. * The Relay operator has two attributes, the multiplier (`m`) and the right shift(`s`) * The compute is a simple loop over the array (i.e., mostly like a unary operation) * The injective schedule has the task to vectorize the loop. **TIR changes** The main TIR changes are the following: * We registered a `tvm.intrin.rule.default.fixed_point_multiply` TVM intrinsic that executes the same operations or the Relay implementation(but using TIR operators). * We created a TIR operator `fixed_point_multiply(x,m,s) which executes the call: `call_intrin(x.dtype, "fixed_point_multiply", x, m, s)` **Intrinsic overload** In order to overload the intrinsic for Armv8-A we need to make use of `tvm.target.intrin.register_intrin_rule`. However, the intrinsics are overloaded by `target_name` which in case of Armv8-A is only `llvm`. This means that, in order to specialize for `llvm.aarch64` we had to hack into [lower_intrin.cc](https://github.com/apache/incubator-tvm/blob/eafb2aa13d6cd223629f17d5f6aab5a8d4fce7f5/src/tir/transforms/lower_intrin.cc#L43-L51) and register a new `llvm.intrin.rule.aarch64.` pattern. Given the above tweak, we could easily exploit the `tvm.target.intrin.register_intrin_rule` method in order to register a version of `fixed_point_multiply` tailored for Armv8-A ISA. The result is similar to the following: def _fixed_point_multiply_arm(op): """ Implementation of fixed point multiplication through arm intrinsics sqrdmulh and srshl """ x = op.args[0] multiplier = op.args[1] shift = op.args[2] # Don't use this intrinsic if we don't have a int32x4 vector if x.dtype != "int32x4": return op # Case 1, shift is negative sqrdmulh = tvm.tir.call_llvm_intrin(op.dtype, 'llvm.aarch64.neon.sqrdmulh', tvm.tir.const(2, 'uint32'), x, multiplier) fixup = (sqrdmulh & (-shift)) >> 31 fixed_up_x = (sqrdmulh + fixup) out = tvm.tir.call_llvm_intrin(op.dtype, 'llvm.aarch64.neon.srshl', tvm.tir.const(2, 'uint32'), sqrdmulh, shift) return out tvm.target.intrin.register_intrin_rule("llvm.aarch64", "fixed_point_multiply", _fixed_point_multiply_arm, override=True) Few notes on the above implementation: * Please note that we also consider the case of a negative right shift (not showed in the code) * The fixup is needed to round to nearest (instead of rounding upward as `srshl` does) * We decided to use the default implementation when the data (`x`) is not a vector # Final notes on performance and precision **Performance** As previously mentioned, the best performance gain in using those intrinsics seems to set around 3%, but the performance improvement we got is only around 1.5%: * The 3% improvement is for a case in which the requantization operation is fused within the main computation loop (e.g., GEMM or spatial convolution). * In TVM, [a quantized convolution is lowered as a sequence of a qnn.conv2d followed by a requantize operator](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651/17). This makes fusing requantization within the compute not possible, explaining why we cannot fully achieve the 3% improvement. **Precision** There are corner cases in which the intrinsic implementation will have a +1/-1 error compared to the default TVM implementation. This is because all the computation is done in 32bits (as opposed to the default 64 bits implementation) introducing a rounding error for some edge cases. # PR The PR for this RFC is being cleaned and will be soon submitted on github --- [Visit Topic](https://discuss.tvm.ai/t/rfc-using-arm-intrinsics-to-implement-fixed-point-multiplication-in-tvm/7150/1) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/355adb2a44116d8feb999b69718e4542081246dd9b872876845a4a4428535f60).