Re: [apache/tvm-rfcs] [RFC] Adding initial SVE implementation (#18)

2021-08-05 Thread Giuseppe Rossini
HI @tqchen , I will try to sporadically comment, since this is a project I prototyped (and enjoyed :) ) when I was in Arm. If I understand your comment correctly, what @MeeraN7 is doing is closer to what you are proposing. Instead of transforming a loop into a Ramp, and passing the ramp "as i

[Apache TVM Discuss] [Development/RFC] Implementing AOT in TVM

2021-04-15 Thread Giuseppe Rossini via Apache TVM Discuss
Hi all, Thanks for the interesting discussion! So, we all agree that there are three points here: * Backend API * Calling convention * Runtime API As things stand today, memory allocation is part of the backend API. This will change with global memory planning, but for now I would tend to ski

[Apache TVM Discuss] [Development/RFC] Implementing AOT in TVM

2021-04-01 Thread Giuseppe Rossini via Apache TVM Discuss
FYI: I will be out for Easter holidays until Tuesday (so I will be replying back to any comments as soon as I come back :slight_smile: ) --- [Visit Topic](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206/15) to respond. You are receiving this because you enabled mailing list

[Apache TVM Discuss] [Development] [RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices

2021-04-01 Thread Giuseppe Rossini via Apache TVM Discuss
Also, a side comment: I will be out for Easter holidays until Tuesday (so I will be replying back to any comments as soon as I come back :slight_smile: ) --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-standalone-code-generation-and-c-runtime-for-stm32-bare-metal-devices/9562/8) to

[Apache TVM Discuss] [Development] [RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices

2021-04-01 Thread Giuseppe Rossini via Apache TVM Discuss
Hi all, I just published the AOT PR upstream: https://github.com/apache/tvm/pull/7785. It has some conflicts probably due to the `CompileEngine` refactoring, and I will fix that soon. I wanted just to let you guys start to have a look @stoa I am wondering how much of your work can use the A

[Apache TVM Discuss] [Development/RFC] Implementing AOT in TVM

2021-04-01 Thread Giuseppe Rossini via Apache TVM Discuss
Hi all, I was finally able to have a first version of the AOT work in a PR upstream. ## PR You can find the PR here: https://github.com/apache/tvm/pull/7785 At this stage, I gladly accept any feedback on things that can be improved in the PR or on issues I might have overlooked. Please, help

[Apache TVM Discuss] [Development/RFC] Implementing AOT in TVM

2021-03-04 Thread Giuseppe Rossini via Apache TVM Discuss
Hi Andrew, > for AOT runtime I agree we do not need JSON parsing or any of the underlying > facilities it brings. However, given it seems like you’re planning to reuse > the C-runtime memory allocator and interfaces in include/tvm/crt/platform.h, > I think it would be great to continue using

[Apache TVM Discuss] [Development/RFC] [RFC] A general task extraction mechanism for auto_scheduler

2020-11-12 Thread Giuseppe Rossini via Apache TVM Discuss
Hi @comaniac, May I ask how the graph ends up with a `nn.conv2d + nn.relu + nn.conv2d + nn.relu` ? Is the graph going through a BYOC kind of partitioning (sorry if the question is naive)? As for S1 vs S2, could we do both? Use an heuristic like "ignore the task without any call node" and th

[Apache TVM Discuss] [Development] Role of the LLVM autovectorizer in TVM

2020-11-06 Thread Giuseppe Rossini via Apache TVM Discuss
Hi all, I am trying to understand the role of the LLVM auto-vectorizer in TVM. Indeed, in `llvm_codegen.cc` we explicitly set: ``` builder.LoopVectorize = true; builder.SLPVectorize = true; ``` And I am trying to determine to what level TVM is relying on LLVM auto-vectorization. ### Wh

[Apache TVM Discuss] [Development] Quantized models and legalization pass

2020-10-30 Thread Giuseppe Rossini via Apache TVM Discuss
Maybe I am wrong, but are you sure that when `cfg.is_fallback` parameters like `cfg['tile_co']` are not defined? We usually set them to some default values (I think). But even if we don't set them, IIUC they will get "some" value among the possible ones. Am I missing something? --- [Visit

[Apache TVM Discuss] [Development/RFC] [RFC]: Improve quantized convolution through mmla instruction

2020-10-30 Thread Giuseppe Rossini via Apache TVM Discuss
cc: @anijain2305, @FrozenGene, @matt-arm, @ramana-arm --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-improve-quantized-convolution-through-mmla-instruction/8336/2) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click

[Apache TVM Discuss] [Development/RFC] RFC: Improve quantized convolution through mmla instructions

2020-10-30 Thread Giuseppe Rossini via Apache TVM Discuss
## Introduction and motivation This RFC is the third set of optimizations to enhance quantized convolution on Arm architectures. To give a brief summary: * Basic Armv8-A convolution implementation (through gemm): https://discuss.tvm.apache.org/t/rfc-improve-quantized-convolution-performance-f

[Apache TVM Discuss] [Development] Quantized models and legalization pass

2020-10-30 Thread Giuseppe Rossini via Apache TVM Discuss
Hi @FrozenGene, I think I see why we don't want to change the layout for no workload (no workload means we don't even know the strategy, I think). What I am missing is why we don't want to change the layout when `cfg.is_fallback`. In that case, the strategy is defined, so we know how the weigh

[Apache TVM Discuss] [Development] Quantized models and legalization pass

2020-10-29 Thread Giuseppe Rossini via Apache TVM Discuss
Hi @FrozenGene, @anijain2305 I can confirm that this works :partying_face:! Very good! Now we can implement algorithms like QNNPack and let the tuner try them together! Thanks both guys! As for the API change, I agree with @FrozenGene that maybe it would be cleaner adding `tinfos` to the `

[Apache TVM Discuss] [Development] Quantized models and legalization pass

2020-10-27 Thread Giuseppe Rossini via Apache TVM Discuss
I got a bit confused above, sorry. It is not about the `inputs` but about the `tinfos`. Just to avoid any additional confusion I tried to print the types of the interesting variables **conv2d_alter_op(attrs, inputs, tinfos, out_type)** ``` print(type(inputs[0])) # print(type(tinfos[0]))

[Apache TVM Discuss] [Development] Quantized models and legalization pass

2020-10-26 Thread Giuseppe Rossini via Apache TVM Discuss
Thanks for the reply, @FrozenGene! The signatures of the two functions are: ``` def _alter_conv2d_layout(attrs, inputs, types, out_type): ``` ``` def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types): ``` While they look similar, `inputs` in `_alter_conv2d_layout` contains actual `Tensor`s

[Apache TVM Discuss] [Development] Quantized models and legalization pass

2020-10-22 Thread Giuseppe Rossini via Apache TVM Discuss
cc @anijain2305 @ramana-arm @FrozenGene (we had this discussion before) --- [Visit Topic](https://discuss.tvm.apache.org/t/quantized-models-and-legalization-pass/8253/2) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click he

[Apache TVM Discuss] [Development] Quantized models and legalization pass

2020-10-22 Thread Giuseppe Rossini via Apache TVM Discuss
Hi all, I am trying to improve quantized performance for memory bound operators (e.g., depthwise or 1x1 convolutions with small shapes). ### Bottom line question Is there any way we can know the strategy picked by the autotuner during the legalization pass of a quantized convolution (qnn.co

[Apache TVM Discuss] [Development/RFC] RFC] Optionally include object file generation in tvmc

2020-10-09 Thread Giuseppe Rossini via Apache TVM Discuss
>From what I see, in `tvmc.compiler`, `export_library()` is called with a >`mod.so` input. I agree we could generate directly the `tar` file, but I think this was done to avoid storing the `.c` files (@leandron will know more than me on this). As for storing directly in the dylib, I am not

[Apache TVM Discuss] [Development/RFC] RFC] Optionally include object file generation in tvmc

2020-10-09 Thread Giuseppe Rossini via Apache TVM Discuss
Hi @tqchen, `tvmc` saves directly the `.so`, `.params` and `.json` in the the `.tar` file it generates. This happens in `tvmc/compiler.py`. I might be wrong, but probably this is because it doesn't want to store the `.c` files in the final artifact (@leandron, can you confirm this?). ---

[Apache TVM Discuss] [Development/RFC] RFC] Optionally include object file generation in tvmc

2020-10-08 Thread Giuseppe Rossini via Apache TVM Discuss
Hi @aca88, The object file produced by `tvmc` does not necessarily include the C runtime. Using a `--bare-metal` flag just refers to the fact that it is mostly useful on a bare-metal target. Anyway, to avoid confusion, I think maybe `--object-file` might be a better choice :slight_smile:

[Apache TVM Discuss] [Development/RFC] RFC] Optionally include object file generation in tvmc

2020-10-08 Thread Giuseppe Rossini via Apache TVM Discuss
cc: @leandron, @ramana-arm --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-optionally-include-object-file-generation-in-tvmc/8120/2) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache

[Apache TVM Discuss] [Development/RFC] RFC] Optionally include object file generation in tvmc

2020-10-08 Thread Giuseppe Rossini via Apache TVM Discuss
## Motivation Currently `tvmc` will only produce a dynamic library version of the network, i.e., an `.so` file stored alongside the other artifacts. This library is usually dynamically linked to other applications. With this change we want to add a flag to `tvmc` to get an object file (i.e.,

[Apache TVM Discuss] [Development/RFC] [RFC] Accelerate quantized convolution through dot-product

2020-09-10 Thread Giuseppe Rossini via Apache TVM Discuss
cc @anijain2305, @FrozenGene, @ramana-arm --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-accelerate-quantized-convolution-through-dot-product/7873/2) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://

[Apache TVM Discuss] [Development/RFC] [RFC] Accelerate quantized convolution through dot-product

2020-09-10 Thread Giuseppe Rossini via Apache TVM Discuss
## Motivation In recent RFCs we successfully boosted convolution performance on native Armv8-A architectures. When using Armv8.2-A and above ISAs, developers are provided with a richer set of instructions, among which the dot-product instruction `udot` (or `sdot`) can be particularly useful

[TVM Discuss] [Development] Loop partitioning, padding and tensorization

2020-08-28 Thread Giuseppe Rossini via TVM Discuss
Hi Animesh, The problem is that I need padding added in the middle of TIR on my (transformed) data tensor. I.e., something like ``` A1 = im2col(A) A2 = pad(A1) C_padded = te.compute([M,N], lambda i, j : sum(A2[i,k]*B[k,j], k) C = unpad(C)+requantization ``` Then I tile on `C` and tensorize o

[TVM Discuss] [Development] Loop partitioning, padding and tensorization

2020-08-28 Thread Giuseppe Rossini via TVM Discuss
Hi all, In my effort to accelerate AArch64 through tensorization, I incurred into an issue. Basically, I am padding my input tensor, to let `tensorize` work (I need rows to be multiple of 4 and cols to be multiple of 16). However, bound inference removes padding (since it is not used) and

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
Hi @anijain2305, Yes, they are fused together, but at the end. `nn.conv2d` is usually implemented as three compute nodes: `pack+core+unpack`. The requantization operator is fused after the `unpack`, while the best would be to fuse after `core` (unpack can be hard to vectorize). However, thi

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
Hi @kparzysz, Yes pattern matching seems hard, we should mark the given set of operation from relay (and use the group later). That is why a middle layer solution, i.e., implementing the fpm in topi rather than tir, might be the right approach --- [Visit Topic](https://discuss.tvm.ai/t

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
Hi @anijain2305, All correct, except that the problem about fusion is more related to the fact that `qnn.conv2d` is lowered as a `nn.conv2d` followed by a `requantize` . The best would be to fuse the requantization before the unpacking of the output tensor (i.e., after the main compute node

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
Hi @tqchen, Thanks a lot for you comments. Actually, I understand the first part of your comment, but I am afraid I don't follow the rest :slight_smile: Just to fully understand: - About adding 0.5(factor) to the bias, what do you mean? The bias is added before the requantization (as an

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
Hi @anijain2305, Both Arm and non-arm machines will use the same `fixed_point_multiply` relay operator, which will have an injective schedule associated with it, calling into `tvm.tir.fixed_point_multiply()`. The only difference is how the `tvm.tir.fixed_point_multiply()` is implemented. O

[TVM Discuss] [Development/RFC] [RFC] Using arm intrinsics to implement fixed point multiplication in TVM

2020-07-01 Thread Giuseppe Rossini via TVM Discuss
# Introduction and motivation Mathematically, the fixed point multiplication (FPM) can be described as: `fpm(x,m,s) = round(x*m*2^(s-31))` In this expression: * `x` is the quantized value to multiply, and `m` and `s` [are an integer multiplier and a shift](https://arxiv.org/pdf/1712.05877.pd

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-22 Thread Giuseppe Rossini
Hi @FrozenGene , @anijain2305 , Any update on this review? Also, is there a way to retrigger the tests? Or should I contact someone in particular? Thanks -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/a

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-19 Thread Giuseppe Rossini
It actually seems related to: https://github.com/apache/incubator-tvm/issues/5827 -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-646684376

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-19 Thread Giuseppe Rossini
Hi @FrozenGene , Thanks for the review! I applied your changes, but I get a (seemingly) unrelated test failure. Could you double check please, and let me know if this has got anything to do with my changes? Thanks -- You are receiving this because you are subscribed to this thread. Reply to

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-15 Thread Giuseppe Rossini
@anijain2305 , thanks for the review! About getting rid of the legalization, I would not do that for now. It is in my backlog to go back to this issue and try to retrieve the strategy from the legalization pass. This should give us more optimization options. If that turns out to be not possible,

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-11 Thread Giuseppe Rossini
Hi @FrozenGene , I gave it another go, but switching legalization on the strategy seems very hard (since we would need the auto-tuner to pick the best data-type for us). So for now, we have to content with the `_alter_conv2d_layout` workaround and try to think a bit more on how we can infer th

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-11 Thread Giuseppe Rossini
Hi @FrozenGene , I agree that different strategies should be available to the auto-tuner. See if the solution proposed is good enough for you (at least as a temporary work-around). For Armv7-A or NCHW, nothing changes, we follow exactly the previous path. For Armv8-A and NHWC we don't convert

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-11 Thread Giuseppe Rossini
So I mean to add a `convert_data_type` pass that is similar to `alter_op_layout` but converts datatype (and we can do something like `if topi_impl == 'spatial_nhwc' converts to int16`. This doesn't seem possible directly in the `alter_op_layout` because only the shapes are passed to that funct

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-11 Thread Giuseppe Rossini
Hi @FrozenGene , The idea of adding the algorithm name to the attributes would work if the legalization step was run after we pick the strategy. It is instead run before, so it is unaware of the strategy picked. Maybe we could add a new pass that runs based on the strategy? Or we can hack in `

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-11 Thread Giuseppe Rossini
Hi @FrozenGene Just to clarify: I am enjoying the discussion, and since the optimization space is wild, I agree that is worth valuating different approaches. * About the Raspberry+mobilenet v2, good to know you are working on Armv8-A (sorry to have assumed otherwise). However, there is still th

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-11 Thread Giuseppe Rossini
Hi @FrozenGene , About the code changes. 1) It will be hard to do this. The point is that the legalization is done in Relay before picking the strategy (thus, it is unaware of the strategy picked). To keep both legalizations I need somehow to pass information from the strategy (e.g., the name o

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-11 Thread Giuseppe Rossini
Hi @FrozenGene , Thanks a lot for your comments. I will address general replies here, and code comments in a separate reply. * I indeed read your discuss [post](https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4), but I thought the work was orthogonal to this one. M

Re: [apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-09 Thread Giuseppe Rossini
CC: @u99127 @anijain2305 -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-641333161

[apache/incubator-tvm] [RFC] Improve quantized convolution performance for armv8 architectures (#5754)

2020-06-09 Thread Giuseppe Rossini
### RFC This PR is based on the following RFC: https://discuss.tvm.ai/t/rfc-improve-quantized-convolution-performance-for-armv8-architectures/6920 ### High level description of the submission The main algorithm lives in: * topi/python/topi/arm_cpu/conv2d_gemm.py(schedule) * topi/python/topi/arm_

[TVM Discuss] [Development/RFC] [RFC] Improve quantized convolution performance for armv8 architectures

2020-06-09 Thread Giuseppe Rossini via TVM Discuss
# Motivation In the current state, TVM float32 performance for armv8 architectures are comparable to frameworks like TFlite (that we will use as a reference through this RFC). However, our analysis shows that pre-quantized networks (i.e., when data and/or weights are transformed from float32