[TVM Discuss] [Development] Loop partitioning, padding and tensorization

2020-08-28 Thread Giuseppe Rossini via TVM Discuss
Hi Animesh, The problem is that I need padding added in the middle of TIR on my (transformed) data tensor. I.e., something like ``` A1 = im2col(A) A2 = pad(A1) C_padded = te.compute([M,N], lambda i, j : sum(A2[i,k]*B[k,j], k) C = unpad(C)+requantization ``` Then I tile on `C` and tensorize o

[TVM Discuss] [Development] Loop partitioning, padding and tensorization

2020-08-28 Thread Giuseppe Rossini via TVM Discuss
Hi all, In my effort to accelerate AArch64 through tensorization, I incurred into an issue. Basically, I am padding my input tensor, to let `tensorize` work (I need rows to be multiple of 4 and cols to be multiple of 16). However, bound inference removes padding (since it is not used) and

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
Hi @anijain2305, Yes, they are fused together, but at the end. `nn.conv2d` is usually implemented as three compute nodes: `pack+core+unpack`. The requantization operator is fused after the `unpack`, while the best would be to fuse after `core` (unpack can be hard to vectorize). However, thi

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
Hi @kparzysz, Yes pattern matching seems hard, we should mark the given set of operation from relay (and use the group later). That is why a middle layer solution, i.e., implementing the fpm in topi rather than tir, might be the right approach --- [Visit Topic](https://discuss.tvm.ai/t

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
Hi @anijain2305, All correct, except that the problem about fusion is more related to the fact that `qnn.conv2d` is lowered as a `nn.conv2d` followed by a `requantize` . The best would be to fuse the requantization before the unpacking of the output tensor (i.e., after the main compute node

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
Hi @tqchen, Thanks a lot for you comments. Actually, I understand the first part of your comment, but I am afraid I don't follow the rest :slight_smile: Just to fully understand: - About adding 0.5(factor) to the bias, what do you mean? The bias is added before the requantization (as an

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
Hi @anijain2305, Both Arm and non-arm machines will use the same `fixed_point_multiply` relay operator, which will have an injective schedule associated with it, calling into `tvm.tir.fixed_point_multiply()`. The only difference is how the `tvm.tir.fixed_point_multiply()` is implemented. O

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
# Introduction and motivation Mathematically, the fixed point multiplication (FPM) can be described as: `fpm(x,m,s) = round(x*m*2^(s-31))` In this expression: * `x` is the quantized value to multiply, and `m` and `s` [are an integer multiplier and a shift](https://arxiv.org/pdf/1712.05877.pd

[TVM Discuss] [Development/RFC] [RFC] Improve quantized convolution performance for armv8 architectures

2020-06-09 Thread Giuseppe Rossini via TVM Discuss
# Motivation In the current state, TVM float32 performance for armv8 architectures are comparable to frameworks like TFlite (that we will use as a reference through this RFC). However, our analysis shows that pre-quantized networks (i.e., when data and/or weights are transformed from float32