Hi Animesh,
The problem is that I need padding added in the middle of TIR on my
(transformed) data tensor.
I.e., something like
```
A1 = im2col(A)
A2 = pad(A1)
C_padded = te.compute([M,N], lambda i, j : sum(A2[i,k]*B[k,j], k)
C = unpad(C)+requantization
```
Then I tile on `C` and tensorize o
Hi all,
In my effort to accelerate AArch64 through tensorization, I incurred into an
issue.
Basically, I am padding my input tensor, to let `tensorize` work (I need rows
to be multiple of 4 and cols to be multiple of 16).
However, bound inference removes padding (since it is not used) and
Hi @anijain2305,
Yes, they are fused together, but at the end.
`nn.conv2d` is usually implemented as three compute nodes: `pack+core+unpack`.
The requantization operator is fused after the `unpack`, while the best would
be to fuse after `core` (unpack can be hard to vectorize).
However, thi
Hi @kparzysz,
Yes pattern matching seems hard, we should mark the given set of operation from
relay (and use the group later).
That is why a middle layer solution, i.e., implementing the fpm in topi rather
than tir, might be the right approach
---
[Visit
Topic](https://discuss.tvm.ai/t
Hi @anijain2305,
All correct, except that the problem about fusion is more related to the fact
that `qnn.conv2d` is lowered as a `nn.conv2d` followed by a `requantize` .
The best would be to fuse the requantization before the unpacking of the output
tensor (i.e., after the main compute node
Hi @tqchen,
Thanks a lot for you comments.
Actually, I understand the first part of your comment, but I am afraid I don't
follow the rest :slight_smile:
Just to fully understand:
- About adding 0.5(factor) to the bias, what do you mean? The bias is added
before the requantization (as an
Hi @anijain2305,
Both Arm and non-arm machines will use the same `fixed_point_multiply` relay
operator, which will have an injective schedule associated with it, calling
into `tvm.tir.fixed_point_multiply()`.
The only difference is how the `tvm.tir.fixed_point_multiply()` is implemented.
O
# Introduction and motivation
Mathematically, the fixed point multiplication (FPM) can be described as:
`fpm(x,m,s) = round(x*m*2^(s-31))`
In this expression:
* `x` is the quantized value to multiply, and `m` and `s` [are an integer
multiplier and a shift](https://arxiv.org/pdf/1712.05877.pd
# Motivation
In the current state, TVM float32 performance for armv8 architectures are
comparable to frameworks like TFlite (that we will use as a reference through
this RFC). However, our analysis shows that pre-quantized networks (i.e., when
data and/or weights are transformed from float32