HI @tqchen ,
I will try to sporadically comment, since this is a project I prototyped (and
enjoyed :) ) when I was in Arm.
If I understand your comment correctly, what @MeeraN7 is doing is closer to
what you are proposing. Instead of transforming a loop into a Ramp, and passing
the ramp "as i
Hi all,
Thanks for the interesting discussion! So, we all agree that there are three
points here:
* Backend API
* Calling convention
* Runtime API
As things stand today, memory allocation is part of the backend API. This will
change with global memory planning, but for now I would tend to ski
FYI: I will be out for Easter holidays until Tuesday (so I will be replying
back to any comments as soon as I come back :slight_smile: )
---
[Visit Topic](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206/15)
to respond.
You are receiving this because you enabled mailing list
Also, a side comment: I will be out for Easter holidays until Tuesday (so I
will be replying back to any comments as soon as I come back :slight_smile: )
---
[Visit
Topic](https://discuss.tvm.apache.org/t/rfc-standalone-code-generation-and-c-runtime-for-stm32-bare-metal-devices/9562/8)
to
Hi all,
I just published the AOT PR upstream: https://github.com/apache/tvm/pull/7785.
It has some conflicts probably due to the `CompileEngine` refactoring, and I
will fix that soon. I wanted just to let you guys start to have a look
@stoa I am wondering how much of your work can use the A
Hi all,
I was finally able to have a first version of the AOT work in a PR upstream.
## PR
You can find the PR here: https://github.com/apache/tvm/pull/7785
At this stage, I gladly accept any feedback on things that can be improved in
the PR or on issues I might have overlooked. Please, help
Hi Andrew,
> for AOT runtime I agree we do not need JSON parsing or any of the underlying
> facilities it brings. However, given it seems like you’re planning to reuse
> the C-runtime memory allocator and interfaces in include/tvm/crt/platform.h,
> I think it would be great to continue using
Hi @comaniac,
May I ask how the graph ends up with a `nn.conv2d + nn.relu + nn.conv2d +
nn.relu` ? Is the graph going through a BYOC kind of partitioning (sorry if the
question is naive)?
As for S1 vs S2, could we do both? Use an heuristic like "ignore the task
without any call node" and th
Hi all,
I am trying to understand the role of the LLVM auto-vectorizer in TVM. Indeed,
in `llvm_codegen.cc` we explicitly set:
```
builder.LoopVectorize = true;
builder.SLPVectorize = true;
```
And I am trying to determine to what level TVM is relying on LLVM
auto-vectorization.
### Wh
Maybe I am wrong, but are you sure that when `cfg.is_fallback` parameters like
`cfg['tile_co']` are not defined? We usually set them to some default values (I
think). But even if we don't set them, IIUC they will get "some" value among
the possible ones. Am I missing something?
---
[Visit
cc: @anijain2305, @FrozenGene, @matt-arm, @ramana-arm
---
[Visit
Topic](https://discuss.tvm.apache.org/t/rfc-improve-quantized-convolution-through-mmla-instruction/8336/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
## Introduction and motivation
This RFC is the third set of optimizations to enhance quantized convolution on
Arm architectures. To give a brief summary:
* Basic Armv8-A convolution implementation (through gemm):
https://discuss.tvm.apache.org/t/rfc-improve-quantized-convolution-performance-f
Hi @FrozenGene,
I think I see why we don't want to change the layout for no workload (no
workload means we don't even know the strategy, I think). What I am missing is
why we don't want to change the layout when `cfg.is_fallback`. In that case,
the strategy is defined, so we know how the weigh
Hi @FrozenGene, @anijain2305
I can confirm that this works :partying_face:! Very good! Now we can implement
algorithms like QNNPack and let the tuner try them together! Thanks both guys!
As for the API change, I agree with @FrozenGene that maybe it would be cleaner
adding `tinfos` to the `
I got a bit confused above, sorry. It is not about the `inputs` but about the
`tinfos`.
Just to avoid any additional confusion I tried to print the types of the
interesting variables
**conv2d_alter_op(attrs, inputs, tinfos, out_type)**
```
print(type(inputs[0]))
#
print(type(tinfos[0]))
Thanks for the reply, @FrozenGene!
The signatures of the two functions are:
```
def _alter_conv2d_layout(attrs, inputs, types, out_type):
```
```
def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
```
While they look similar, `inputs` in `_alter_conv2d_layout` contains actual
`Tensor`s
cc @anijain2305 @ramana-arm @FrozenGene (we had this discussion before)
---
[Visit
Topic](https://discuss.tvm.apache.org/t/quantized-models-and-legalization-pass/8253/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
he
Hi all,
I am trying to improve quantized performance for memory bound operators (e.g.,
depthwise or 1x1 convolutions with small shapes).
### Bottom line question
Is there any way we can know the strategy picked by the autotuner during the
legalization pass of a quantized convolution (qnn.co
>From what I see, in `tvmc.compiler`, `export_library()` is called with a
>`mod.so` input.
I agree we could generate directly the `tar` file, but I think this was done to
avoid storing the `.c` files (@leandron will know more than me on this).
As for storing directly in the dylib, I am not
Hi @tqchen,
`tvmc` saves directly the `.so`, `.params` and `.json` in the the `.tar` file
it generates. This happens in `tvmc/compiler.py`. I might be wrong, but
probably this is because it doesn't want to store the `.c` files in the final
artifact (@leandron, can you confirm this?).
---
Hi @aca88,
The object file produced by `tvmc` does not necessarily include the C runtime.
Using a `--bare-metal` flag just refers to the fact that it is mostly useful on
a bare-metal target.
Anyway, to avoid confusion, I think maybe `--object-file` might be a better
choice :slight_smile:
cc: @leandron, @ramana-arm
---
[Visit
Topic](https://discuss.tvm.apache.org/t/rfc-optionally-include-object-file-generation-in-tvmc/8120/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://discuss.tvm.apache
## Motivation
Currently `tvmc` will only produce a dynamic library version of the network,
i.e., an `.so` file stored alongside the other artifacts. This library is
usually dynamically linked to other applications.
With this change we want to add a flag to `tvmc` to get an object file (i.e.,
cc @anijain2305, @FrozenGene, @ramana-arm
---
[Visit
Topic](https://discuss.tvm.apache.org/t/rfc-accelerate-quantized-convolution-through-dot-product/7873/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://
## Motivation
In recent RFCs we successfully boosted convolution performance on native
Armv8-A architectures. When using Armv8.2-A and above ISAs, developers are
provided with a richer set of instructions, among which the dot-product
instruction `udot` (or `sdot`) can be particularly useful
Hi Animesh,
The problem is that I need padding added in the middle of TIR on my
(transformed) data tensor.
I.e., something like
```
A1 = im2col(A)
A2 = pad(A1)
C_padded = te.compute([M,N], lambda i, j : sum(A2[i,k]*B[k,j], k)
C = unpad(C)+requantization
```
Then I tile on `C` and tensorize o
Hi all,
In my effort to accelerate AArch64 through tensorization, I incurred into an
issue.
Basically, I am padding my input tensor, to let `tensorize` work (I need rows
to be multiple of 4 and cols to be multiple of 16).
However, bound inference removes padding (since it is not used) and
Hi @anijain2305,
Yes, they are fused together, but at the end.
`nn.conv2d` is usually implemented as three compute nodes: `pack+core+unpack`.
The requantization operator is fused after the `unpack`, while the best would
be to fuse after `core` (unpack can be hard to vectorize).
However, thi
Hi @kparzysz,
Yes pattern matching seems hard, we should mark the given set of operation from
relay (and use the group later).
That is why a middle layer solution, i.e., implementing the fpm in topi rather
than tir, might be the right approach
---
[Visit
Topic](https://discuss.tvm.ai/t
Hi @anijain2305,
All correct, except that the problem about fusion is more related to the fact
that `qnn.conv2d` is lowered as a `nn.conv2d` followed by a `requantize` .
The best would be to fuse the requantization before the unpacking of the output
tensor (i.e., after the main compute node
Hi @tqchen,
Thanks a lot for you comments.
Actually, I understand the first part of your comment, but I am afraid I don't
follow the rest :slight_smile:
Just to fully understand:
- About adding 0.5(factor) to the bias, what do you mean? The bias is added
before the requantization (as an
Hi @anijain2305,
Both Arm and non-arm machines will use the same `fixed_point_multiply` relay
operator, which will have an injective schedule associated with it, calling
into `tvm.tir.fixed_point_multiply()`.
The only difference is how the `tvm.tir.fixed_point_multiply()` is implemented.
O
# Introduction and motivation
Mathematically, the fixed point multiplication (FPM) can be described as:
`fpm(x,m,s) = round(x*m*2^(s-31))`
In this expression:
* `x` is the quantized value to multiply, and `m` and `s` [are an integer
multiplier and a shift](https://arxiv.org/pdf/1712.05877.pd
Hi @FrozenGene , @anijain2305 ,
Any update on this review?
Also, is there a way to retrigger the tests? Or should I contact someone in
particular?
Thanks
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/a
It actually seems related to:
https://github.com/apache/incubator-tvm/issues/5827
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-tvm/pull/5754#issuecomment-646684376
Hi @FrozenGene ,
Thanks for the review!
I applied your changes, but I get a (seemingly) unrelated test failure.
Could you double check please, and let me know if this has got anything to do
with my changes?
Thanks
--
You are receiving this because you are subscribed to this thread.
Reply to
@anijain2305 , thanks for the review! About getting rid of the legalization, I
would not do that for now. It is in my backlog to go back to this issue and try
to retrieve the strategy from the legalization pass. This should give us more
optimization options. If that turns out to be not possible,
Hi @FrozenGene ,
I gave it another go, but switching legalization on the strategy seems very
hard (since we would need the auto-tuner to pick the best data-type for us).
So for now, we have to content with the `_alter_conv2d_layout` workaround and
try to think a bit more on how we can infer th
Hi @FrozenGene ,
I agree that different strategies should be available to the auto-tuner. See if
the solution proposed is good enough for you (at least as a temporary
work-around). For Armv7-A or NCHW, nothing changes, we follow exactly the
previous path.
For Armv8-A and NHWC we don't convert
So I mean to add a `convert_data_type` pass that is similar to
`alter_op_layout` but converts datatype (and we can do something like `if
topi_impl == 'spatial_nhwc' converts to int16`.
This doesn't seem possible directly in the `alter_op_layout` because only the
shapes are passed to that funct
Hi @FrozenGene ,
The idea of adding the algorithm name to the attributes would work if the
legalization step was run after we pick the strategy. It is instead run before,
so it is unaware of the strategy picked.
Maybe we could add a new pass that runs based on the strategy? Or we can hack
in `
Hi @FrozenGene
Just to clarify: I am enjoying the discussion, and since the optimization space
is wild, I agree that is worth valuating different approaches.
* About the Raspberry+mobilenet v2, good to know you are working on Armv8-A
(sorry to have assumed otherwise). However, there is still th
Hi @FrozenGene ,
About the code changes.
1) It will be hard to do this. The point is that the legalization is done in
Relay before picking the strategy (thus, it is unaware of the strategy picked).
To keep both legalizations I need somehow to pass information from the strategy
(e.g., the name o
Hi @FrozenGene ,
Thanks a lot for your comments. I will address general replies here, and code
comments in a separate reply.
* I indeed read your discuss
[post](https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4),
but I thought the work was orthogonal to this one. M
CC: @u99127 @anijain2305
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-tvm/pull/5754#issuecomment-641333161
### RFC
This PR is based on the following RFC:
https://discuss.tvm.ai/t/rfc-improve-quantized-convolution-performance-for-armv8-architectures/6920
### High level description of the submission
The main algorithm lives in:
* topi/python/topi/arm_cpu/conv2d_gemm.py(schedule)
* topi/python/topi/arm_
# Motivation
In the current state, TVM float32 performance for armv8 architectures are
comparable to frameworks like TFlite (that we will use as a reference through
this RFC). However, our analysis shows that pre-quantized networks (i.e., when
data and/or weights are transformed from float32
47 matches
Mail list logo