[Apache TVM Discuss] [Development] [RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices

Andrew Reusch via Apache TVM Discuss Mon, 29 Mar 2021 20:57:18 -0700


hi @stoa,

Thanks for the elaborate RFC here! You bring up a bunch of great points.

This is a really strong proposal and I think overall fairly well aligned with
the direction I want to take microTVM. Particularly since similar code has been
posted to the forum before, it would be great to have a discussion around the
implementation details here.

For the purposes of discussion, let's break this proposal apart into pieces:

P1. Code Emitter (e.g. Executor implementation or GraphRuntime replacement)

P2. Tensor memory allocation

P3. The firmware-facing API

Finally, I'd like to discuss ways to reduce code duplication and avoid
splintering the overall design of µTVM. In particular, it seems like this could
become a Project API implementation. I'll leave some thoughts below on each
piece.

### Code Emitter

This approach is similar to some others posted to the forum before:
- [µTVM Static Code
Generator](https://discuss.tvm.apache.org/t/tvm-static-runtime-code-generator/8986)
by @r.stahl
- [my hack to do
this](https://github.com/areusch/incubator-tvm/tree/aot-experiment)

In general, I think the direct-to-C++ route (as compared with the TIR route) is
simple and easy to hack on, but the TIR route lends us more avenues for
graph-level optimization. However, I don't think that the accessibility should
be understated--tvm has a pretty steep learning curve. I think the challenge
with checking this code into the TVM repository is testing and maintenance, as
I'll discuss later.

### Tensor Memory Allocation

This looks very similar to what I'd propose we implement in the TIR-based
GraphPlanMemory pass. A couple of thoughts:

- Does your approach handle workspace memory, allocated inside kernels (e.g.
TVMBackendAllocWorkspace)?
- Could you say more about " it may be necessary that two models share their
‘*activation* ’ pools?" Are these separate instances of the same model or two
different models?

### Firmware-facing API

TVM does have a standard object-oriented [Module-based Model Runtime
Interface](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025)
RFC. This one is based around our PackedFunc concept, heavily used in the C++
runtime as a lanugage-agnostic abstraction. In firmware we certainly don't need
such an abstraction. Somewhat related, [issue
7596](https://github.com/apache/tvm/issues/7596) is considering how to
implement PackedFunc calls in the C backend.

Next, I agree that the C runtime API isn't very friendly for firmware
developers. There are a couple pieces here:

1. PackedFunc are looked-up by string name. This is inefficient in terms of
both memory and runtime. I think we still need to maintain that string lookup
to keep compatibility with the RPC server implementation which drives
autotuning. However, I wonder if we might consider making it a convention to
implement PackedFunc with particular symbol names so that they could be called
directly in production without string lookup.
2. Arguments and return values need to be wrapped in TVMValue. I don't think we
can get around this one, but we could implement wrappers to the firmware-facing
executor functions to simplify this.

I wonder if there are other differences or critiques you could find of the C
runtime that would improve it? It would be great to at least standardize the
runtime between these two implementations. This would be in a follow-on RFC,
though.

### Code Emitter vs TIR-based approach

Given that a number of features implemented in this RFC are on the µTVM roadmap
(but intended to be implemented at the TIR level), I think the main difference
in the long run here is that this RFC directly generates C++ code rather than
passing TIR to the `c` backend. I think there are merits to both this approach
and the TIR-based AOT being implemented by @giuseros.

As discussed in Code Emitter section, I do think that the TIR-based approach
gives us more future avenues to develop µTVM. However, I don't want to ignore
how accessible approaches like these are.

Relative to `main` right now, this RFC has a bunch of things that we don't
have: AOT, memory pinning, API changes. It seems like we could allow an
implementation like this to coexist as a Project API with roughly these steps:
1. Rework the PoC to consume Model Library Format and implement the Project
API. Regarding the question of whether this should be applicable to autotuning
or also to deployment: my thought was that this would be decided by the project
API implementation (either create an option or a separate implementation for
each scenario).
2. When available--use the TIR-based comprehensive memory planner (it seems
nearly identical to the one you've implemented, and would generate JSON
describing the memory pools).
3. Ensure at least the TVMBackend* functions are used from the C runtime, which
provides a pathway to migrate to the TIR-based memory planner and avoids
diverging too far in terms of generated code.

Finally, I'd also propose we consider simplifying the C runtime API as
discussed in Firmware-facing API section.

### Testing and Code Location

Could you speak a bit more to how this code could be tested in the TVM CI?
That's my chief concern with checking it in as a Project API implementation. I
posted some
[thoughts](https://discuss.tvm.apache.org/t/rfc-tvm-project-api/9449/4) about
the bar to checking in Project API implementations to the tvm repo.

Some discussion points:

D1. Between this approach and a TIR-based AOT, do you guys have a preference
which you would prefer to work with, assuming both were implemented?

D2. While the Python APIs are perfectly fine, one goal of Model Library Format
is to enable downstream tools such as this to work with TVM with less API
drift. Do you guys prefer the Python API, or would this also be an interface
you'd be open to consuming?

D3. In general, the challenge with checking code such as this into the TVM repo
is testing. Particularly with bare-metal code, it's hard to test without
hardware in the loop, and the TVM CI doesn't really have a provision for that
now. Do you guys have a proposal how we might test this code?

---
[Visit
Topic](https://discuss.tvm.apache.org/t/rfc-standalone-code-generation-and-c-runtime-for-stm32-bare-metal-devices/9562/2)
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/9889413f0aa1698e8e6bd3b7fe189c1bcc8c6ddb35e3330dc81d18b52e3207cf).

[Apache TVM Discuss] [Development] [RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices

Reply via email to