@Mousius @tqchen @stoa
Great discussions here and apologies for my delay in reply! I'm just adding some thoughts here as context in reviewing PR 8023 and since I kind of left this thread hanging. > Thinking about how we expose this to the user, here is a proposed behaviour: > >* Default behaviour of AOT should be A0+C0 to be compatible with the rest of >the ecosystem >* With an optional parameter `--no-typed-operators` TVM would instead produce >A1+C1 internal operators but leave the A0+C0 entrypoint I agree with the A1 + C1 idea for internal functions and a way to provide a C0 interface when an internal function is called externally. It might be nice to (eventually) have a way to generate this interface for arbitrary functions--this way, AOT (or operator functions) could be called from the TVM RPC server. Adding `--no-typed-operators` makes sense to me, but would propose to change the name. `--no-typed-operators` reads pretty generically to me and could imply something like "operator + is aware of the types of its arguments." but that's always true. i'd suggest two modifications: 1. given we are abusing Target to hold runtime-specific information, let's choose a name for which the default 0 value preserves the existing behavior. `--typed-operators` defaults `true` here, but we can't document that properly (in `target_kind.cc`) since we are abusing Target. 2. let's choose a more specific name, such as `--dltensor-only-function-signatures`. or alternatively, something that references "call_unpacked," maybe `--unpacked-api`? i don't know that `--typed-signatures` fully encapsulates the effect of the flag, since it not only removes `type_code` but also assumes there is a `DLTensor` for each argument. >* With an optional parameter `--micro-entrypoint` TVM (in AOT mode) would >switch to producing a A1+C2 entrypoint at the top level. This has no effect on >Graph execution as it doesn’t have such an entrypoint. I think it'd be interesting to generalize `--micro-entrypoint` to generating C0 wrappers for functions. Then, you could use AOT with host-driven execution, which could be handy for prototyping. To add some future-looking thoughts, which don't concern the immediate implementation here: I am curious whether or not we may eventually be able to implement both the embedded deployment API and a version of the [Module-based Model Runtime Interface](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025) in TIR as was proposed in the original [Relay AOT](https://discuss.tvm.apache.org/t/guideline-relay-aot/5977) guideline post. Doing this would remove wrapper code and consolidate the implementation into generated code. Then we could just generate the correct (e.g. C0 or C1/C1-typed-wrapper) API depending on the runtime scenario (e.g. RPC-driven or standalone deployment). Otherwise, I would think for prototyping it would be interesting explore a generic PackedCFunc wrapper generator to enable use of the AOT and operator functions from µTVM RPC server. If we go this route, my proposal would be to design the embedded-facing deployment API to enable application-defined memory management (e.g. following @stoa proposal), and then build a wrapper around this API to the PackedFunc-based Model Runtime interface (potentially tweaking as necessary). In practice, I suspect this wouldn't be a significant departure from the embedded deployment API--the main changes we would need to make are around initialization, where application-specified memory needs to be used rather than dynamically-allocated memory. However, this may mean that there are two "consumers" of the AOT TIR output. I'll raise a follow-on RFC around these points as we get closer to landing the deployment API. Here, I mainly wanted to provide some thoughts on future directions that could affect `--micro-entrypoint` towards @Mousius comment: > It’s worth noting I think it’s worth getting this right at the user level so > we don’t need to change it much when arguments get refactored. We could consider broadening `--micro-entrypoint` to `--micro-api` or some equivalent should we go this route. ### Dynamic shapes > That in mind, should there be sufficient space for them to run in this way, > they can always turn the option off and incur the penalty for the dynamic > behaviour? It seems like there are cases where there is some constrained variability in model input and output shapes. I agree we should prioritized fixed model sizes, but it would be great to ensure that the API could handle some limited dynamism if needed. We don't need to address this in the initial PR, since we're wrapping everything now, but it would be great to consider in the Runtime Interface RFC. ### Header Dependencies > I’m considering it more in terms of dependencies you have to move across to > include in your project, particularly in terms of raw files; `standalone_crt` > contains a lot of files that I’m unlikely to need in a deployment, I wouldn’t > expect to integrate all of them into an embedded system, particularly if I > have strict requirements for code being included. My opinion is that we should just be explicit about what you need to include and make it easy to do that (e.g. place only those include files in some dir with appropriate prefix). I think it's always possible to strip code down, but we should make it simple to get started, particularly as users may revise models frequently. The `standalone_crt` bundle is a start in that direction--we may need to further split it as we introduce the embedded C runtime interface. I don't think it's a big deal to define extra unused structs, types, or static functions, and would not consider that a sufficient condition to create two sets of header files. I think the complexity to the end user outweighs the benefit of simpler code. ### Embedded C runtime interface I'll follow up on that RFC with discussion about the API itself. ### Moving forward Just to state my idea of where I think things are headed here, to remove any confusion about my late reply: 1. Let's work to merge PR 8023 (I may propose some name changes to those Target flags) 2. Let's continue to discuss on [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/5) and arrive at an acceptable deployment interface. 3. When these two building blocks are landed, we can consider future directions around host-driven AOT, dynamic shapes, and whether we should implement the user-facing AOT interfaces in TIR. --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-utvm-aot-optimisations-for-embedded-targets/9849/11) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/c5cf17215b9bbc480e4b9729451044b8d36c2734b48cd1da180b5c206c60c438).