[Apache TVM Discuss] [Development] [RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices

Andrew Reusch via Apache TVM Discuss Wed, 31 Mar 2021 22:53:00 -0700


hey Arthur @stoa,


Great, here are some follow-ups:

### Shared activation pools
>> Could you say more about " it may be necessary that two models share their 
>> ‘activation ’ pools?" Are these separate instances of the same model or two 
>> different models?
>
>Two different models may be deployed simultaneaously in a target but do not 
>necessarily run in parallel. In this case, one ‘activation’ pool can be 
>allocated instead of two (of course big enough to accomodate the larger of the 
>two models).
>
>On the other hand, two separate instances of the same model can share a single 
>‘activation’ pool (built-in, for example), or the application can allocate two 
>different ‘activation’ pools, one per instance, if the two instances need to 
>be run in parallel.

cool, this makes sense to me. so the memory-pinning implementation will need to 
perhaps export custom data types or at least memory pool sizing information to 
make this feasible.

### Firmware-facing API

> The main discussion point here is the application interface for deploying and 
> using the packaged model. The packaging itself is well addressed by the Model 
> Library Format RFC (see 
> [below](https://discuss.tvm.apache.org/t/rfc-standalone-code-generation-and-c-runtime-for-stm32-bare-metal-devices/9562/3#code-emitter-vs-tir-based-approach)).
>  The factory pattern aims at minimize the API divergence for different 
> deployment scenarios. The arguments for enforcing the generic factory pattern 
> seem to be these:
> 
> * To have the same mechanism for packaging and loading.
> * To let the users learn as little as possible.
> 
> From the two alternatives, we would prefer the API specialization for the 
> micro TVM. In case of embedded ML there already exists an established API, 
> such as the 
> [X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html) or the 
> [TensorFlow Lite for 
> Microcontrollers](https://www.tensorflow.org/lite/microcontrollers), the NXP 
> tools expose a similar API as well; therefore aligning the micro TVM API to 
> the GraphRuntime is less relevant since users are already familiar with these 
> embedded APIs. Specializing micro TVM API also works well with the [Project 
> API](https://discuss.tvm.apache.org/t/rfc-tvm-project-api/9449) concept.
> 
> This said, our C runtime can also go with the factory pattern. In particular, 
> we have the ‘*model descriptors*’ that can be “loaded” at runtime and they 
> carry all necessary “meta”-information from each model. Based on this, the 
> factory pattern could be implemented. However, given that we are in C, not 
> C++, this will be special in terms of the API and syntax, therefore does not 
> seem to make sense.

In my mind, some setup function is needed to accomplish:
1. initializing memory set aside for tensors and parameters
2. configuring accelerators, including starting (possibly) backgrounded 
transfers of any programming/parameters.

I think that the TVM function for this is the factory function (right now, 
typically `mod["default"]()`), and the X-Cube equivalent is 
`ai_[<model_name>_]create`. Does that match your understanding?

Apologies, I think I was a bit confused before. IIUC, I think this port aims to 
implement an API aligned with the X-Cube API, at least for now only aiming to 
enable deployments to STM32--does that also seem right to you? I'm curious 
whether this API aims to replace the C runtime and Model-based Module Runtime 
Interface for all targets or if this would just be confined to STM32 for now.

Then the next questions I have would be around how you'd like to proceed with 
this going forward. At present, the STM32 generator PR you've proposed has 
several features that are missing from the microTVM compiler (e.g. memory 
pinning, AOT, etc). As we implement these features, will it be possible to 
incorporate them into this generator as well (I.e. to take advantage of 
compiler-level improvements we might be able to make, such as graph-level 
optimization)? 

If so, it would be great to keep the STM32 API semantically similar to the TVM 
C runtime API, so that we can later invoke TVM C runtime APIs from the STM32 
functions. I suspect these are pretty similar, but just want to understand the 
goals for code-reviewing your PR. One possible scenario is: when we have a TVM 
AOT runtime and memory pinning available, we could rework `ai_create` to 
instantiate the TVM C AOT runtime. It would also be great to use the STM32 API 
as inspiration to expand the TVM APIs to provide equivalent functionality. 
Please let me know your thoughts here!

> > 1. PackedFunc are looked-up by string name. This is inefficient in terms of 
> > both memory and runtime. I think we still need to maintain that string 
> > lookup to keep compatibility with the RPC server implementation which 
> > drives autotuning. However, I wonder if we might consider making it a 
> > convention to implement PackedFunc with particular symbol names so that 
> > they could be called directly in production without string lookup.
> 
> If I understand right, the main application must be able to lookup operator 
> functions via their string names. This can be implemented by providing an 
> additional API method with the C runtime. Since it will be used with 
> autotuning, we probably do not care as much for the performance of the string 
> lookup and can allow the string compare, for example. Perhaps I did not get 
> the point ?

I think you mostly got it. Another clarification: while we haven't seen much of 
this _yet_ in microTVM, when multiple TVM runtime Modules are present (e.g. 
BYOC is such a case in C++), the calling convention between both modules is 
PackedFunc. You see this today in that all generated operators have the 
`TVMBackendPackedCFunc` signature.

Technically in the C++ runtime, when a generated operator impl wants to call a 
PackedFunc from the same runtime Module, it's supposed to invoke 
`TVMBackendGetFuncFromEnv` to do a string lookup of the function. This allows, 
in the C++ runtime, accelerator control to be done from Python, C++, etc. In 
the C runtime, I think this is overkill and we should just ensure there is a 
standard global symbol registered and call it--however, we need to qualify such 
a symbol with the module name (e.g. the model name, same thing passed to 
runtime factory). Such a change would need an RFC, so we haven't 100% gone down 
this path.

In practice today, an easy workaround is to use Tensorization or to have BYOC 
emit `tir.call_extern` nodes, which bypass the string lookup and directly call 
a function. But, then those BYOC compilers are responsible for the calling 
convention, a minor nuisance.
> 
> > 2. Arguments and return values need to be wrapped in TVMValue. I don’t 
> > think we can get around this one, but we could implement wrappers to the 
> > firmware-facing executor functions to simplify this.
> 
> I am not sure I understand the issue. Can we elaborate ?

I think this will likely actually not apply here anymore, but just saying that 
to call any of our PackedFunc from C, the pattern right now is to use the 
infrastructure in 
[packed_func.h](https://github.com/apache/tvm/blob/main/include/tvm/runtime/crt/packed_func.h)
 (e.g. instantiate a `TVMArgs` and encode the arguments in that structure). 
This is really burdensome compared with a normal C function call. Should we 
consider an improved, standardized C-facing TVM API, I would propose we wrap 
this detail to hide it from the user.

> From our prospective, the TIR based implementation is preferable and when it 
> is possible, we would like to move our code emitter there.

Great!

### Code Emitter vs TIR-based approach

> > 2. When available–use the TIR-based comprehensive memory planner (it seems 
> > nearly identical to the one you’ve implemented, and would generate JSON 
> > describing the memory pools).
> 
> We thought that the ‘storage_id’ carried the results of the memory planner. 
> Is there another machanism ? Agree on this point as well.

When we do tensor pinning, we'll implement a new memory planner similar to the 
one you have, I think. We'll probably keep the `storage_id` concept, but export 
additional information (e.g. `pool_id` to identify which memory pool and 
`offset` to identify the start of the tensor `data` field in that pool). 
`storage_id` would continue to identify the shared memory space occupied by 1 
or more tensors with disjoint lifecycles.

So my question here is: in the future, woudl you be open to using a TVM-side 
implementation of a memory-pool, statically-allocated memory planner? I think 
it sounds like that'd be okay, but just confirming.

>> 3. Ensure at least the TVMBackend* functions are used from the C runtime, 
>> which provides a pathway to migrate to the TIR-based memory planner and 
>> avoids diverging too far in terms of generated code.
>
> Tell me if this is what you meant ?
> 
> One important point from our implementation is that the memory is managed by 
> the application via whatever method the application may choose. The C runtime 
> does not perform any memory allocations (no `TVMBackendAlloc` or 
> `TVMBackendFree`). As it is, our runtime does not provide memory allocation 
> methods but if there is a reason to do that (some sort of TVM storage), it 
> can be hooked to the `TVMBackend*` functions. The C runtime does use the 
> `TVMbackendLastError`.

Yeah roughly that seems to match what I was implying. When we do tensor 
pinning, I think it's likely I'll propose to add some `tensor_id` (note: 
different from `storage_id`, as `storage_id` could contain multiple 
`tensor_id`) to `TVMBackendAllocWorkspace`, and a lookup table could just 
return a pointer into the pre-allocated memory pool. `TVMBackendFreeWorkspace` 
would become a no-op. Will that work for you guys?

>> Finally, I’d also propose we consider simplifying the C runtime API as 
>> discussed in Firmware-facing API section.
>
>Are their particular simplification points that you have in mind ?
Mainly:
- consider removing the need to use PackedFunc looked-up by string name, and 
instead provide more natural C wrappers around those functions
- consider creating a mapping from PackedFunc string name to a global symbol 
name to shortcut this lookup, as they won't likely be dynamically overridden in 
embedded applications.

### Testing and Code Location

>> Could you speak a bit more to how this code could be tested in the TVM CI?
>
>Good question. The demo application cannot be tested in hardware without an 
>available board. However, we can provide a sanity check for the generated C 
>code and the runtime layer that can be built on the host (x86). This way, the 
>code emitter and runtime will be tested, but not the on-the-board application.
>
> As for the code location, the demo application is intended for the STM32 
> users to start on TVM (as a company we distribute the CubeMX solution whith 
> eventually the TVM integrated inside). A separate CubeMX projects will most 
> probably also exist but I think it is important to have a clear demo project 
> in a spirit of the TVM (not hidden inside the CubeMX tool). We would go with 
> `apps/microtvm/no-ci` or `apps/microtvm` with an x86 sanity check CI.
>
>We need to statuate on this. What is your preference ?

This would be fantastic. Would it be possible to checkin a docker container 
e.g. `tlcpack/ci-stm32` which could run this in our CI? Then we can just make 
it a first-class example and place in `apps/microtvm/stm32` or a similar 
sub-directory of `microtvm` of your choosing.

### Questions

> From what I understand, the [Model Library 
> Format](https://discuss.tvm.apache.org/t/rfc-tvm-model-library-format/9121) 
> is intended as a *deployment* format in TVM.

It's more intended to be a sort of API between TVM and project generators such 
as this. You can think of it as the data structure used in Project API to 
transmit the generated model (and potentially runtimes in the case of AOT) to 
the code emitter, which would be implemented by the Project API. It's not so 
much intended that projects would deploy it as is, but use the on-disk 
locations as standard places from which to consume the various artifacts from 
TVM. The reason it's on disk rather than in-memory is so it can be easily 
exported to a user for debugging (e.g. what did TVM do with my model?) and also 
for unit/integration testing TVM and BYOC.

> 1. The [Model Library 
> Format](https://discuss.tvm.apache.org/t/rfc-tvm-model-library-format/9121) 
> looks like a draft proposal (correct me if I am wrong here). Do we have a 
> more formal document describing the format? For eample, what are contents of 
> the `runtime-config/aot` ?

In TVM we tend to come to lazy-consensus. The implemented format is about the 
same as that proposal, with an exception that I pulled out the C runtime (it's 
supplied separately in the Project API). You're right though that it's very new 
and we need more documentation. There are a couple ways we will address this:

1. We have a [new RFC process](https://github.com/apache/tvm-rfcs/pull/2) we're 
adopting, which will place adopted RFCs in that repo (and I would then update 
the RFC to match as-built).
2. We'll create documentation on https://docs.tvm.ai when the Project API + 
Model Library Format both land in `main`.

> 2. The `host` vs `target-key` : I imagine that in the STM32 case, the 
> generated sources, the `network.c` , the `operators.c` go to the `host/src` 
> directory, right ? We also generate the `network_data.c` with params. I’d 
> propose to place this with the `host/src` sources as well.

For now, `host` is the only `target-key`. I have to make another RFC about 
this, but here is a sketch of my thinking--sorry, this part is still a bit 
rough.

The idea is that accelerator (or coprocessor with different architecture) could 
have different `target-key`. Programmable accelerators would get 1 `target-key` 
per program (but a program may live on many instances--maybe you have 3 
instances with a conv2d program and 2 more with a maxpool program; in this 
case, 2 target-key would exist, e.g. `accel-conv2d` and `accel-maxpool`). These 
directories could also contain host-executed control source code (e.g. the 
PackedFunc or extern func to launch compute), but that code would be dedicated 
to operating those accelerators.

> 3. The generated C code targets a standalone runtime API, which is different 
> compared to the TVM-built GraphRuntime API from the `crt` . Should we 
> populate the ‘crt’ with the standalone C runtime code instead ? Minor point: 
> the Makefile is not generated by the standalone code emitter since included 
> from external project.

Actually I deleted that `crt` directory in final impl--sorry, this was very not 
clear from that RFC. I'll update the thread to clarify. I think given my 
comment above about consuming the Model Library Format, you won't need to worry 
about populating it.

>As I explained earlier, we will put in place a minimal sanity testing of the 
>generated C model and its runtime on the CI host.
>
>In addition, we work with the Linaro foundation and they have a farm of HW 
>boards they use for their CI. Linaro are also looking into the micro TVM and 
>it seems reasonable to try finding a common ground where the TVM could use the 
>Linaro infrastructure for the micro TVM development. I am adding @vinceab, our 
>Linaro person to this thread.

Great, I think that sounds quite reasonable. We aren't likely going to be able 
to put non-cloud hardware-in-the-loop for the TVM CI, but having a nightly 
should hopefully suffice.

I think this should answer most of your questions--let me know if I've missed 
any! This seems like a great direction for microTVM.

Andrew





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-standalone-code-generation-and-c-runtime-for-stm32-bare-metal-devices/9562/4)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/47010611f9f42091a66151d5ae1af2753cc9b1c67bf3b6509265a432d2948f97).

[Apache TVM Discuss] [Development] [RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices

Reply via email to