# Motivation

In the current state, TVM float32 performance for armv8 architectures are 
comparable to frameworks like TFlite (that we will use as a reference through 
this RFC). However, our analysis shows that pre-quantized networks (i.e., when 
data and/or weights are  transformed from float32 to int8 or uint8) are 
significantly slower than TFlite. The table below shows a summary for 
inception_v3:

|tuning|Threads|tflite/tvm|tvm/tflite|
| --- | --- | --- | --- |
|un-tuned|1|0.3663|2.77|
|tuned (x86 schedules)*|1|0.4730|2.11|
|tuned (x86 schedules)*|4|0.6176|1.6|

* Note that the x86 schedules are faster than native ARM ones

You can observe TVM is about 2x slower than TFlite in the single-thread case, 
and 60% slower than Tflite when multi-threading. We found that the main 
bottleneck is the convolution algorithm used: indeed, analyzing the heaviest 
convolution in isolation, we observed the same order of performance degradation.

The main goal of this RFC is to address this issue and provide a better 
convolution algorithm for pre-quantized networks.

# Background

Let's consider the convolution strategy that is currently used in 
topi.arm_cpu.conv2d_spatial_pack_nchw. The x86 schedule is similar, but it also 
packs the channels in batches to be more cache efficient (and indeed 
performance are slightly better).  This is a NCHW convolution, which means that 
in order to run we need to alter the native NHWC layout of TFlite to NCHW. 

The convolution is composed of three parts:

* Transform data and weights from `int8/uint8` to `int16`.
* Transform the input tensor (assuming that the weights have already been 
transformed by the alter_op pass). This is basically an Im2Col transformation, 
i.e., the output. `data_vec` is transformed from `NCHW` to `NCHW-KW-KH`, where 
`KH` and `KW` are `(3,3)` in this case.
* The second is the actual compute. From a high level standpoint this algorithm 
executes the convolution by accumulating outer product of `1x16` by `16x1` 
tile. Every element of the `1x16` data tile is replicated into a vector of 16 
elements, element-wise multiplied by the `16x1` weight's tile and accumulated. 
Since we transformed the data to `int16` we can safely use a `smla` instruction 
(signed multiply accumulate) to implement this sort of 
multiplication/accumulation.

# Proposal

The main question is:  **is this the best we can do without the dot-product 
instruction** ?

Even if we don't have the dot-product instruction in the base ISA, we still 
have integer specific instructions available in the AArch64 SIMD ISA:

* SADALP/UADALP: [Signed/Unsigned add and accumulate long 
pairwise](https://developer.arm.com/docs/dui0801/j/a64-simd-vector-instructions/uadalp-vector)
* SMULL/UMULL: [Signed/Unsigned Long 
Multiply](https://developer.arm.com/docs/dui0801/j/a64-simd-vector-instructions/umull-umull2-vector)
* SMULL2/UMULL2: [Upper-half signed/unsigned Long 
Multiply](https://developer.arm.com/docs/dui0801/j/a64-simd-vector-instructions/umull-umull2-vector)

Those instructions (with the addition of ADDP - [add 
pair](https://developer.arm.com/docs/dui0801/j/a64-simd-vector-instructions/addp-vector))
 are sufficient to emulate a dot-product. Why this is important?

Because it gives us the possibility to remain in 8bit. We don't need to convert 
the data to int16 before-hand. In turn, this means:

* Loading less data each time
* Doing more operation per data loaded (i.e., increasing the arithmetic 
computation of the convolution).

## Convolution implementation strategy

As previously pointed out, in the original schedule in 
`topi.arm_cpu.conv2d_spatial_pack_nchw`, there is a data transformation very 
similar to Im2col. The core computation, instead, is very similar to a GEMM 
computation even though it seems coupled with some of the transformations to 
interleave the data (and the weights).

That stated, we made the following design choices:

1. We decided to explicitly use Im2Col + GEMM in our implementation (instead of 
implicitly using them as previously done). This is because a GEMM convolution 
is more modular. We can separately worry about computation and memory layout .

2. We picked a NHWC layout. This is because in NHWC we don't need to 
col2im-transform the output (i.e., the output transform is mostly a reshape and 
can be in-lined). NHWC gives us also the option to avoid a `ConvertLayout` pass 
from TFlite. 

3. For now, we don't introduce any tuning knobs. The idea is for now to provide 
a good-enough general gemm-convolution schedule. Later on, we will make the 
algorithm to adapt to different convolutions by adding the appropriate knobs. 

4. We made the GEMM convolution structure very similar to Winograd (as opposed 
to Winograd, though, the number of operations is exactly the same of a direct 
convolution):

   * Transform the Input  (i.e.,  Im2Col + padding + interleave )
   * Transform the weights (i.e., Reshape + padding + interleave + 
block_transpose)
   * Execute GEMM
   * Transform the output back (i.e., un-interleave + reshape)

The remaining part of this section is split in the following way:

1. Core algorithm (whose implementation is done in assembly and exposed through 
tensorization)
2. Input Transform
3. Weight transform
4. Output transform

## Convolution core algorithm

Input: { `a`: `4xK` slice of 8-bit elements of A,  `b'`: `4xK` slice of 8-bit 
elements of B}

Output: `c`: a `4x4` block of 32-bit elements, such that `c = a * b`

Notes:

1. This is GEMM, we don't care about strides, dilation, layout, etc... They are 
addressed in the Im2Col transformation. 
2. the `4xK` of `b'` is a `Kx4` slice transposed (more about this later)
3. We assume that `K` is a multiple of 16 (if not, we need to pad the matrix 
`A` and `B`)

Let's go through the pseudo-code steps to compute the first row of `c`, `c[0, 
0:4]`. Remember that `c[j,i] = sum(a[j, 0:K].*b[i,0:K])`

    for k = 0:16:K
        v0 = a[0, k:k+16] // a0
        v4 = b[0, k:k+16] // b0
        v5 = b[1, k:k+16] // b1
        v6 = b[2, k:k+16] // b2
        v7 = b[3, k:k+16] // b3
     
         // Lower-part mul
        v8.8h = umull(v0.8h,  v4.8h) // v8 = a0[8:16].*b0[8:16]
        v9.8h = umull(v0.8b, v5.8b) // v9 = a0[8:16].*b1[8:16]
        v10.8h = umull(v0.8b, v6.8b) // v10 = a0[8:16].*b3[8:16]
        v11.8h = umull(v0.8b, v7.8b) // v10 = a0[8:16].*b4[8:16]
     
         // Accumulate
        v16.4s = uadalp(v8.8h) // v16[0:4] = [v8[0:2]+v8[2:4],
                               // v8[4:6] + v8[6:8],
                               // v8[8:10] + v8[10:12],
                               // v8[12:14] + v8[14:16]   
        v17.4s = uadalp(v9.8h) // same as above with v9
        v18.4s = uadalp(v10.8h) // same as above with v10
        v11.4s = uadalp(v11.8h) // same as above with v11

        // Higher-part mul
        v8.8h = umull2(v0.8h,  v4.8h) // v8 = a0[0:8].*b0[0:8]
        v9.8h = umull2(v0.8b, v5.8b) // v9 = a0[0:8].*b1[0:8]
        v10.8h = umull2(v0.8b, v6.8b) // v10 = a0[0:8].*b3[0:8]
        v11.8h = umull2(v0.8b, v7.8b) // v10 = a0[0:8].*b4[0:8]

        // Accumulate again

        v16.4s = uadalp(v8.8h) 
        v17.4s = uadalp(v9.8h)
        v18.4s = uadalp(v10.8h)
        v11.4s = uadalp(v11.8h)
    end
          
    // At this point:
    // v16 contains the four partial sums of a[0, 0:K].*b[0,0:K], let's call 
them (a,b,c,d)
    // v17 contains the four partial sums of a[0, 0:K].*b[1,0:K], let's call 
them (e,f,g,h)
    // v18 contains the four partial sums of a[0, 0:K].*b[2,0:K], let's call 
them (i,j,k,l)
    // v19 contains the four partial sums of a[0, 0:K].*b[3,0:K], let's call 
them (m,n,o,p)
    // Let's try to accumulate everything in v16
     
    v16.4s = addp(v16.4s, v17.4s) // v16 = (a+b, c+d, e+f, g+h)
    v17.4s = addp(v18.4s, v19.4s) // v17 = (i+j, k+l, m+n, o+p)
    v16.4s = addp(v16.4s, v17.4s) // v16 = (a+b+c+d, e+f+g+h, i+j+k+l, m+n+o+p)
     
    // Now v16 contains the sum(a[0, 0:K].*b[i,0:K]) for i =0, 1, 2, 3
    c[0:4] = v16 

The same algorithm can be repeated for the other rows of the `c` buffer. 

Some points worth noticing:

1. The block we choose `(4,4)` is forced by the number of registers we have 
(32).
2. Optimizing the register allocations is not trivial. We first need to compute 
`c[0, 0:4]` and `c[1,0:4]` (first half) and then `c[2,0:4]`, `c[3,0:4]` (second 
half) in order to not run out of registers.
3. If we are sure that `255*255+255*255` never appears as accumulation (e.g.,  
if the weights are quantized over 0:254) we can save the intermediate `uadalp` 
(i.e., the accumulation). We can do `umull` → `umlal2` → `uadalp`.  TFlite uses 
this assumption, but converts before-hand to int8 (which means that `-128*-128 
+ (-128*-128)` never appears), thus using a sequence of `smull`→ `smal2`→ 
`sadalp`. We might use this as a `--fast-math` option.
4. The dot-product would simply ease the register pressure. Basically, instead 
of doing a `4x4` block, we can do a `12x4` block because we don't need 
intermediate registers to save the accumulations.
5. We implement this algorithm through inline assembly that we inject directly 
in TVM through `tensorize`.

## Input transform

Remember the input shape is of the form : `(batches, IH, IW, IC)`, the output 
shape is `(batches, OH, OW, OC)`, while the weight shape is: `(KH, KW, KI, 
KO)`. The relation between input/weights/output shapes can be found 
[here](https://arxiv.org/pdf/1603.07285.pdf)

Input transform means getting the data ready for GEMM. The input is transformed 
in two stages:  **Im2Col**  +  **Interleaving**

**Im2Col**

I won't delve into this, as this is a very known transformation (for instance, 
see [here](https://cs231n.github.io/convolutional-networks/#conv)). It's 
important to notice that padding, dilation and strides are all considered by 
this transformation. The result is A, an (M,K) matrix where `M=OH*OW`, `N = 
KH*KW*IC`. 

**Interleaving (or array packing)**

Interleaving is the common process of placing elements accessed by gemm close 
to each other (instead that strides away). The following picture shows how we 
interleave the matrix `A`.

![InputTransform|690x217](upload://jCUTChnZqVq6n75cR8gcaiZLhoR.png) 

Note that we need to be sure that the input dimensions (M,N) are multiple of 4 
and 16, respectively. We achieve that by padding the image. 

The compute node that we use to achieve this sort of transformation is the 
following:

`A_interleaved = te.compute((batches, M_padded // 4, K_padded // 16, 4, 16), 
lambda b, x, y, z, w: A[b, z + 4 * x, w + 16 * y], name='A_interleaved')`

So now A_interleaved is a `[M//4, K//16, 4, 16]` tensor where `A[0, 0, :, :]` 
represents the `0,0` block `A[0,1,:,:]` represents the `(0,1)` block and so on. 
This is very similar to the one described in the [GEMM 
tutorial](https://docs.tvm.ai/tutorials/optimize/opt_gemm.html#array-packing).

## Weight transform

Weight transform means getting the weights ready for GEMM. The weights are 
transformed in multiple stages:  **Flattening**  +  **Interleave_transpose**

**Flattening**

In the case of the weights, we don't need any im2col. We only need to flatten a 
`[KH,KW,KI,KO]` tensor into a `[KH*KW*KI,KO]`. This requires a very simple 
compute node, which can be easily in-lined. 

**Interleaving and block-transposing**

The weight transform is slightly different from the Input transform (and from 
the traditional array packing) since we want to transpose the blocks in order 
to execute the pseudo-dot-product algorithm described above. 

The idea of the transformation is described in the following image:

![WeightTransform|690x215](upload://eb0FDz5vS7rPpBbKymzzKjLTTDX.png) 

As you can see, we move from a `[K,N]` matrix to a `[N/4, K/16, 4, 16]` matrix, 
de facto transposing each block of the input image. Also in this case, if `K` 
or `N` are not multiple of 16 and 4, we need to pad `B`.

The compute node we use to implement this transformation is the following:

`B_interleaved_t = te.compute((N_padded // 4, K_padded // 16, 4, 16), lambda x, 
y, z, w: kernel_flat[w + 16 * y, z + 4 * x], name='weight_block_reshape')`

As well as the Input Transform case, B_interleaved_t is a `[N/4,K/16,4,16]` 
tensor, but in this case the block `B_interleaved_t[0,0,:,:]` is the first 4x16 
block of `B'` (i.e., `B` transposed). 

One last thing to note is that we can offload the weight transformation, since 
the weights are known at compile time. So all the flattening, interleaving and 
reshaping will happen before the graph is executed. 

## Output transform

Lastly, we need to transform our GEMM output. Those transformations are a bit 
simpler than the ones on the inputs and the weights.

The output from gemm is a `[batches, M//4,N//4, 4, 4]` tensor and there are two 
separate transforms that need to happen:  **Unpacking**  +  **Unflattening**

**Unpacking**

We need to reshape the output back to a plain matrix `[batches, M, N]`. This is 
achieved by the following intuitive compute node:

C = te.compute((batches, M, N), lambda b, x, y: C_interleaved[b, x // 4, y // 
4, idxm(x, 4), idxm(y, 4)], name="C", tag='injective')

One nice thing to note is that we can declare the node as 'injective'. This 
means that other injective-transformations can be attached to this one. Those 
other transformations can be for instance the requantize steps (shift, add, 
sum, mul, etc...). 

**Unflattening**

Remember that our output C is now a `[batches, M, N] = [batches, OH*OW, OC]` 
matrix. Since we are working in NHWC we only need to unflatten the shape. This 
transformation will be later computed_at the unpacking transformation showed 
above.  The simple node to do the unpacking is the following:

`C = te.compute((batches, M, N), lambda b, x, y: C_interleaved[b, x // 4, y // 
4, idxm(x, 4), idxm(y, 4)], name="C", tag='injective')`

# Results

We added a row to the original table that shows the improvements we obtained by 
using the GEMM schedules for inception_v3:

|tuning|Threads|tflite/tvm|tvm/tflite|Improvement|
| --- | --- | --- | --- | --- |
|un-tuned|1|0.3663|2.77||
|tuned (x86 schedules)|1|0.4730|2.11||
|tuned (x86 schedules)|4|0.6176|1.6||
|untuned( GEMM schedules)|1|0.9541|1.05|~2x|
|untuned( GEMM schedules)|4|0.93|1.07|~1.55|

As you can see from the table, by using the new GEMM schedule we are comparable 
to TFlite performance and gained a 2x and 50% speed-up for the single thread 
and the multi-thread case, respectively.

CC: @ramana-arm, @anijain2305, @janimesh





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-improve-quantized-convolution-performance-for-armv8-architectures/6920/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/60d48e3b90a0c250ca44fb133525ad30622b9e813d050e526586450f9e5a3394).

Reply via email to