hi @Mousius,
thanks for your proposal and your detailed evaluation of its impact on runtime resources! i agree these improvements will simplify the generated code and improve performance. Some thoughts below: > When we packed and unpack values, we make use of the `data` portion of a > DLTensor but nothing else, this leads to a lot of the structure being unused. > In embedded systems space can become an absolute premium and this unused > structure consumes precious bytes. One consideration is that removing metadata such as `shape` will constrain our ability to implement models with more complex runtime requirements, such as dynamic shapes. I think dynamic shapes need significant consideration when implementing in constrained spaces such as µTVM, but it would be great to ensure we don't add a barrier to PoC with these optimizations. By that here, I just mean that I'd like to ensure we retain a path to keep DLTensor support independent of the other parts of this proposal e.g. API changes, if possible. This may mean we need to e.g. analyze which fields of the DLTensor are used and pass each field as an argument. > These represent not just code sizes but also cycle times and power usage, > further to this the stack savings would allow such a model to run under > Zephyr on an M0 which by default is allocated only small stack sizes There are a lot of impacts on various performance numbers here, but my opinion is that the stack impact is outsize compared with the others. I agree we should investigate this on that basis alone. > One issue is that device_type and device_id are checked later and must be > bound to pass the invariant checks. It seems like here, we should hoist those checks up into the AOT function in whichever pass modifies the function signatures. I don't think we should go away from `device_type/device_id` at this point--but I would like to explore a way to more accurately represent `device_type` to the compiler such that multiple different e.g. BYOC or CPU devices can be described in terms familiar to the user. > * What should the flag be called? `--unpack-functions`, `–tiny`, > `--no-runtime`, `–micro` etc? This is really only an option with the AOT executor. Maybe `--aot-use-typed-signatures` or something? I'd also like to address the comments from the AOT PR about pushing executor and runtime options out of Target. If we do this, it may be possible to group this under e.g. `aot_executor_opts`, and then perhaps another name could make more sense. @tqchen: > As we can see that even with the same PackedFunc API, as long as we can do > proper inlining, allocating DLTensor and other items on stack, the resulting > function call can be reduced to the same function as the minimum non-packed > version. It is often a problem in embedded systems that enabling the optimizer can do things like change timing of functions which can result in drastically different behavior of the SoC. In some cases while under development, SoCs may not function with optimization on or vice versa. Developers may not simply be able to choose which optimization level suits them. Another common case is debugging TVM-generated code, which is often interesting with optimizations off. While you're certainly right that modern compilers may be able remove unused portions of DLTensor at `-O2`, relying on this behavior means that a developer's first task in debugging generated code is to understand why they're seeing a stack protection fault, and increase the stack size when compiling with `-O0`. This is not a desirable property for embedded deployment, and it's a big ask particularly when they are trying to debug generated code at `-O0` which is likely in the same file. The other thing to point out here: this proposal affects the top-level interwork code that links operators together. It is in our interest to make this code as simple and straightforward to understand as possible. This point is so important that we should probably also support a mode where the AOT top-level function is exported in C and the rest in LLVM. For this reason, I am not in favor of relying on downstream compilers to clean up the stack for us. We should emit code which is straightforward and uncluttered, so that we don't introduce hidden dependencies on particular compilers which may come to bite us later with attempts to broaden support for µTVM on other devices. ### Other discussion topics > This means that by reducing the overall dependencies, such as removing the > need for DLPack, we can reduce the amount of foreign code required. I'm not sure I agree DLPack is a huge overhead in terms of build dependencies here--it's included in the `standalone_crt` bundle. Could you clarify? --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-utvm-aot-optimisations-for-embedded-targets/9849/3) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/9bc55e3df3679a6789edcdb861e996159c5f35be3ff32dd2568949399f07dcdb).