hi @Mousius,

thanks for your proposal and your detailed evaluation of its impact on runtime 
resources! i agree these improvements will simplify the generated code and 
improve performance. Some thoughts below:

> When we packed and unpack values, we make use of the `data` portion of a 
> DLTensor but nothing else, this leads to a lot of the structure being unused. 
> In embedded systems space can become an absolute premium and this unused 
> structure consumes precious bytes.

One consideration is that removing metadata such as `shape` will constrain our 
ability to implement models with more complex runtime requirements, such as 
dynamic shapes. I think dynamic shapes need significant consideration when 
implementing in constrained spaces such as µTVM, but it would be great to 
ensure we don't add a barrier to PoC with these optimizations. By that here, I 
just mean that I'd like to ensure we retain a path to keep DLTensor support 
independent of the other parts of this proposal e.g. API changes, if possible. 
This may mean we need to e.g. analyze which fields of the DLTensor are used and 
pass each field as an argument.

> These represent not just code sizes but also cycle times and power usage, 
> further to this the stack savings would allow such a model to run under 
> Zephyr on an M0 which by default is allocated only small stack sizes 

There are a lot of impacts on various performance numbers here, but my opinion 
is that the stack impact is outsize compared with the others. I agree we should 
investigate this on that basis alone.

> One issue is that device_type and device_id are checked later and must be 
> bound to pass the invariant checks.

It seems like here, we should hoist those checks up into the AOT function in 
whichever pass modifies the function signatures. I don't think we should go 
away from `device_type/device_id` at this point--but I would like to explore a 
way to more accurately represent `device_type` to the compiler such that 
multiple different e.g. BYOC or CPU devices can be described in terms familiar 
to the user.

> * What should the flag be called? `--unpack-functions`, `–tiny`, 
> `--no-runtime`, `–micro` etc?

This is really only an option with the AOT executor. Maybe 
`--aot-use-typed-signatures` or something? I'd also like to address the 
comments from the AOT PR about pushing executor and runtime options out of 
Target. If we do this, it may be possible to group this under e.g. 
`aot_executor_opts`, and then perhaps another name could make more sense.

@tqchen:
> As we can see that even with the same PackedFunc API, as long as we can do 
> proper inlining, allocating DLTensor and other items on stack, the resulting 
> function call can be reduced to the same function as the minimum non-packed 
> version.

It is often a problem in embedded systems that enabling the optimizer can do 
things like change timing of functions which can result in drastically 
different behavior of the SoC. In some cases while under development, SoCs may 
not function with optimization on or vice versa. Developers may not simply be 
able to choose which optimization level suits them.

Another common case is debugging TVM-generated code, which is often interesting 
with optimizations off. While you're certainly right that modern compilers may 
be able remove unused portions of DLTensor at `-O2`, relying on this behavior 
means that a developer's first task in debugging generated code is to 
understand why they're seeing a stack protection fault, and increase the stack 
size when compiling with `-O0`. This is not a desirable property for embedded 
deployment, and it's a big ask particularly when they are trying to debug 
generated code at `-O0` which is likely in the same file.

The other thing to point out here: this proposal affects the top-level 
interwork code that links operators together. It is in our interest to make 
this code as simple and straightforward to understand as possible. This point 
is so important that we should probably also support a mode where the AOT 
top-level function is exported in C and the rest in LLVM.

For this reason, I am not in favor of relying on downstream compilers to clean 
up the stack for us. We should emit code which is straightforward and 
uncluttered, so that we don't introduce hidden dependencies on particular 
compilers which may come to bite us later with attempts to broaden support for 
µTVM on other devices.

### Other discussion topics

> This means that by reducing the overall dependencies, such as removing the 
> need for DLPack, we can reduce the amount of foreign code required.

I'm not sure I agree DLPack is a huge overhead in terms of build dependencies 
here--it's included in the `standalone_crt` bundle. Could you clarify?





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-utvm-aot-optimisations-for-embedded-targets/9849/3)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/9bc55e3df3679a6789edcdb861e996159c5f35be3ff32dd2568949399f07dcdb).

Reply via email to