[Apache TVM Discuss] [Development] [RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices

Arthur Stoutchinin via Apache TVM Discuss Mon, 29 Mar 2021 09:11:24 -0700


# Standalone code generator and C runtime for STM32 bare-metal devices


## Background

This RFC aims to collect the TVM community feedback on the following subjects:
- Standalone compilation targeting embedded bare-metal platforms
- ML user API for embedded applications
- Integration of the TVM with the standard embedded development tools and 
  projects

The RFC falls into the micro TVM line of development and compliments projects 
outlined in
the [ÂµTVM M2 
Roadmap](https://discuss.tvm.apache.org/t/tvm-microtvm-m2-roadmap/8821), in 
particular these two:
- [AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), 
  proposing a standalone code generator for embedded targets, and which
  has been outstanding in the TVM community for a while now.
- [Project API](https://discuss.tvm.apache.org/t/rfc-tvm-project-api/9449), a
  recent RFC proposing a standard "interface layer" between the TVM and the 
  generated embedded firmware code.

This RFC has an associated [PR](https://github.com/apache/tvm/pull/7742)
implementation including a demo application that has been tested on a number 
of ML models with the STM32 Discovery ARM based development board. 
The [PR](https://github.com/apache/tvm/pull/7742) also serves as a 
Proof-Of-Concept for the concepts outlined in the above
[AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206) RFC.

## Objectives

The first objective of this proposal is to move forward in implementing the
standalone compilation flow from TVM targeting the embedded and bare-matal
devices.
As stated with the 
[AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), having to 
interpret a JSON graph at runtime is a problem in
embedded and bare-metal environments:

 - The workflow is very hard to implement on a micro-controller, since memory 
   is usually a costly resource in embedded environments, and the json file is 
   usually quite large.
 - The memory allocation in the current TVM stack is split, with inter-operator 
   memory managed at json/relay level while the intra-operator memory is 
   managed at TIR level. 

Additionally,

 - JSON handling incurrs extra processing overhead
 - Dynamic library handling incurs extra processing and memory overhead
 - Data placement in memory, given a very diversified and specialized set of 
   memory hierachies, is difficult to handle.

Indeed, the embedded application deployment flow is different from TVMs 
modules deployment via a JSON graph and a dynamically loaded operators library.
A typical application deployment in resource-constraint embedded 
environments is done via downloading a standalone binary executable image on 
the target device.
>From the user prospective, the ML model is embedded inside a larger main 
application. In such environment, the resource management (memory, etc.) is 
handled by this main application.

The issue has been first addressed 
in the [AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206),
which proposes the generation of a standalone C implementation for ML models,
and the definition of an associated C runtime API. 
Our RFC proposal is different from the 
[AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206) in two 
ways:
- Our approach is more lightweight in terms of the engineering and development
  effort: our ***code emitter*** takes the TVM generated
  JSON graph as input and seats on top of the TVM module, while the 
  [AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206)
  implements a full blown code generator integrated with the TVM TIR
  representation. The two approaches may be complimentary to each other as
  the lightweight code emitter allows a quick and un-intrusive putting in
  place a code generator for a new target.
- We propose a richer embedded ML API drawn from two well established  and 
  robust development frameworks, the 
[X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html) and the 
[TensorFlow Lite for 
Microcontrollers](https://www.tensorflow.org/lite/microcontrollers). This API
  closely follows the current industry trends and will benefit wide TVM
  adoption.

The [AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206) is
currently the work in
progress. In the meantime, we have developed a working implementation of the
standalone embedded development flow for the STM32 microcontrollers. 
We propose to
integrate this development into the TVM framework at least as an intermediate 
step until the fully functional 
[AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206) is 
implemented, and we can put in place a STM32 specific 
AoT code generator. This will enable:

 - A quick access to the STM32 development for the TVM community boosting
   the TVM integration with the STM32 development tools.
 - We will probably need to develop not one but a number of standalone
   code generators. For
   example, a sequential executor such as we generate with this RFC will 
   likely not
   fit a multi-core target platform, where operators may need to be wrapped
   into some sort of threading code; or for an accelerator enabled platform
   where it may be necessary to generate some communication and synchronization
   code. Therefore, the lightweight approach will enable quick and early
   implemention of new code generators for different target platforms.

The memory management issue is not yet fully addressed 
within the TVM framework. 
Typically, in an embedded environment, the main application requires full
and fine control of the memory management.
>From the [AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), 
>the main application would have a limited data placement possibility
constrained by the implementation of the runtime memory manager.
We propose to leave a full freedom of memory management 
to the main application (no TVM integrated memory manager). This will enable 
standard and familiar memory management
techniques, such as using linker scripts, for example.
Another existing project that follows this direction is the project to 
estimate the memory footprint 
of the graph from TVMC [ÂµTVM M2 
Roadmap](https://discuss.tvm.apache.org/t/tvm-microtvm-m2-roadmap/8821). 

Finally, in embedded application development environment, the TVM needs to be
integrated with the standard embedded development flows, such as the 
[STM32CubeMX](https://www.st.com/en/embedded-software/x-cube-ai.html), for 
example. Such frameworks typically include a large set of tools
that are outside of the scope of the TVM (target board HW configuration, etc.).
The issue is considered in 
[Project API](https://discuss.tvm.apache.org/t/rfc-tvm-project-api/9449), 
which proposes to introduce a new ***Project API*** with the main goal to allow 
TVM to drive builds on firmware platforms for the purpose of 
AutoTVM. Our proposed
[PR](https://github.com/apache/tvm/pull/7742) implements a number of
building blocks that fit well the [Project 
API](https://discuss.tvm.apache.org/t/rfc-tvm-project-api/9449) framework. 

Below, we explain our proposed approach in details and highlight some
differences from the earlier RFC proposals.

## Standalone Code Generation

The TVM compiler generates three objects:
- The JSON graph of the ML model
- The C library of the kernels (targetted at the arm devices for the STM32
  platforms)
- The params dictionary

In order to enable the standalone code generation that fits better current
existing embedded development practice, we propose following 
approach:

 - Perform the JSON file processing at compile time, instead of
   at runtime. This is achived by implementing a ***code emitter*** that, given
   a TVM Module, generates a standalone C implementation of the graph 
   processing for a given target platform.
 - Define a ***runtime C API*** that exposes graph processing functions to the
   main application.
 - Leave entirely the ***memory management*** and data placement to the main
   application.

### Code Emitter

We propose to build a ***standalone C implementation*** of ML models from the 
TVM Module, instead of processing the JSON
graph at runtime. This implementation is generated by the ***code emitter***
that seats on top of TVM Module and is implemented in Python.
The ***code emitter*** currently targets the STM32 microcontrollers.

The C implementation is exposed to the application via the `ai_model_info` 
descritor of the compiled model:
```
typedef struct {
  const char          * name;
  const char          * datetime;
  const char          * revision;
  const char          * tool_version;
  const char          * api_version;
  uint16_t              n_nodes;
  uint8_t               n_inputs;
  uint8_t               n_outputs;
  uint32_t              activations_size;
  uint32_t              params_size;
  ai_ptr                activations;
  ai_tensor          ** inputs;
  ai_tensor          ** outputs;
  const ai_ptr (*ai_get_params)(void);
  ai_status (*ai_create)(const ai_ptr weights, const ai_ptr activations);
  ai_status (*ai_destroy)();
  ai_status (*ai_run)(ai_tensor *input[], ai_tensor *output[]);
} ai_model_info;
```

The ***code emitter*** generates the C code including:
 - Instantiation of all tensors (activations and weights). The tensors `data` 
fields (the data buffer addresses) remain un-assigned until the runtime.
 - A small number of interface functions for model deployment and execution

The ***code emitter*** optionally instantiates the built-in '*activations*'
memory pool (see [Memory Management](#memory-management) below).
In this case, the `ai_model_info.activations` contains the address of the 
built-in pool,
otherwise NULL.
Model inputs/outputs data can also be optionally allocated in this memory 
pool, sharing memory with the model activation buffers.

The emitter generates following interface functions:
```
ai_get_params : returns the runtime memory address of the params
```
```
ai_create : instantiates a model in device memory
```
```
ai_destroy : removes a model instance from the device memory
```
```
ai_run : executes the model graph, calling operators from kernels lib
```

Our implementation is fairly similar to the one proposed in the
[AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206) with
the following diferences:
- Our `ai_model_info` model descriptor contains more information
  compared to the `tvm_model_t` descriptor from the 
[AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206). 
Additional information proposition is drawn
  from our experience with the 
[X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html) and 
[TensorFlow Lite for 
Microcontrollers](https://www.tensorflow.org/lite/microcontrollers) tools.
- In addition to the `operators.c` (model kernels implementation) and the
  `network.c` (model graph implementation), we also generate the
  `network_data.c` containing a table with model parameters (weights).
  This table is assigned to the '*params*' memory pool (see [Memory 
Management](#memory-management) below) and, at link time, is
  allocated an application-specified memory
  region via the linker script.

A STM32 ***code emitter*** for the STM32 MCU based boards has been implemented
and can be seen here: [PR](https://github.com/apache/tvm/pull/7742).
Similar emitters can be quickly created targeting any other platform, for
example a multi-core parallel platform.

### Memory management

The ML model memory is managed via memory pools. Model activations
are placed into the '*activations*' pool, model params are placed into
the '*params*' pool.
The '*activations*' memory pool can be setup by the main application or
built-in with the model at the model generation time. 
The '*params*' memory pool is setup at the model generation time.
Statically setup pools are allocated memory at link time via the
application-specified linker script. The '*activations*' memory pool can
also be dynamically allocated at runtime by the main application on the
heap.

The application manages its memory allocation via several mechanisms:
- The TVM compiler communicates the number of activations and params tensors 
and their buffer assignment via the 'storage id' JSON graph attribute.
- The ***code emitter*** assigns the application data, '*activations*' and 
'*params*' 
  pools to dedicated ELF sections (except for dynamically allocated data).
- Linker performs the placement of ELF sections based on application-specified
  linker script.
  Arbitrary target platform memory hierarchy is thus supported: FLASH, RAM,
  external, internal, etc., without that the TVM have explicit knowledge of it.
- The main application manages any static or dynamic runtime memory allocation 
  that can be required.
  For example, it may be necessary that two models share their '*activation*'
  pools, or that two instances of the same model have separate input and output
  buffers, etc.

### The Runtime C API

In a typical embedded application use-case, a ML model is managed under the 
control of the main application, more precisely: 
 - the model is placed in memory (activations, weights, heap) 
 - the model is given inputs
 - the model is run
 - the outputs are recovered by the main application for further processing

We propose a slim ***runtime API*** for developing the embedded standalone ML
applications drawn from our experience with the 
[X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html) and the 
[TensorFlow Lite for 
Microcontrollers](https://www.tensorflow.org/lite/microcontrollers)
tools.
The objectives being:
- Efficient implementation in terms of performance and minimalist memory
  footprint.
- Memory management under the control of the main application.
  For example, any runtime memory allocations can be avoided by statically 
  placing data to appropriate memory regions at link time. This enables an 
  easy experimentation with the data placement, and flexibility.
- The possibility to build multi-model applications combining separately
  compiled models. These
  models can optionally share their activation and/or inputs/outputs memory.
- The possibility to include multiple instantiations of the same model in
  a single application.
- Enable a generic main application with all model-specific information
  available from the model implementation.

Our slim ***runtime API*** provides access to the TVM generated model 
implementation 
via a small model interface.

First, the `ai_model_info` descriptor is directly visible from the main 
application. It holds all information about the
model. For example, such information includes the number of model inputs and
outputs, associated tensors, their types and shapes, etc.
Details are available from this [PR](https://github.com/apache/tvm/pull/7742).
Several models can be linked together into a single application, each one with 
its own model descriptor.

A model descriptor is instantiated into a deployed model instance by calling 
the function:
```
ai_status ai_create (ai_model_info * nn, ai_ptr activations, ai_handle  
*handle);

```
The function returns a particular instance of a model, which is an 
obscure `handle` 
hiding current implementation details. During the `ai_create` call, the
`data` fields for the activations and params tensors (their buffers addresses)
are setup.

The size and memory address of the '*activations*' and '*params*' pools can
be retrived at runtime with:
```
uint32_t ai_get_activations_size (ai_handle handle);
ai_ptr ai_get_activations (ai_handle handle);
uint32_t ai_get_params_size (ai_handle handle);
const ai_ptr ai_get_params (ai_handle handle);
```

We propose to extend the `DLTensor` with additional *quantization* information:
```
typedef struct {
  /*!
   * \brief The TVM tensor.
   */
  DLTensor dltensor;
  /*!
   * \brief The quantization info, if quantized
   */
  ai_quantization_info * quant;
} ai_tensor;
```

The *quantization* information is needed by the main
application for processing model inputs and outputs. There may be one
additional use - debugging/monitoring the intermediate activations, but
it is still unclear how useful this can be.

The main application can query a model instance for a number of informations, 
such as:
```
int32_t ai_get_input_size (ai_handle handle);
int32_t ai_get_output_size (ai_handle handle);
ai_tensor * ai_get_input (ai_handle handle, int32_t index);
ai_tensor * ai_get_output (ai_handle handle, int32_t index);

```
etc.

The `ai_run` function executes the TVM model graph mimiking the GraphRuntime
execution:
```
ai_status ai_run (ai_handle handle);
```
For the current STM32 target, this is a simple sequential
single processor execution that calls each model kernel one at a time.


All API functions return an `ai_status` value and set the `TVMLastError` in
case of a problem. This can be retrieved by the main application via:
```
const char * ai_get_error (ai_handle handle);
```

The above ***runtime API*** is more explicit compared to the one proposed by the
[AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), which
proposes a minimalist runtime C API, consisting mainly of two functions:
```
// Helper function to initialize a DLTensor
DLTensor TVMInitializeDLTensor(void *data, DLDataType* dtype, DLContext* ctx, 
int64_t* shape, int64_t num_dim);
 
// Helper function to run the `run_func` within the generated library network.o.
tvm_crt_error_t TVMRuntime_Run(tvm_model_t *model, DLTensor *inputs, int 
num_inputs, DLTensor *outputs, int num_outputs);
```

We make several observations:

1. Full information about the compiled model is not availale.

2. Some useful functionalities are missing, for example, the input/output
   quantization information. 

3. The memory allocator manager is not under the main application control.
   In the embedded development flow this is a critical point - the memory 
   management is typically handled by the main application. 

4. The inputs/outputs buffers cannot be shared with the activations
   storage, which can be important for memory footprint reduction for
   small models.

In both RFCs, the model implementation is exposed to the main applicationthe 
via a slim API layer. However, this RFC API is richer giving more
flexibility, in particular for the memory management.
Another minor difference is that we do not create or manage model tensors, 
they are
built-in with the model implementation. However, the API provides the main 
application with functions for accessing these tensors and managing their 
storage.


## Example

```
ai_handle handle;  /* instance of the model */
ai_ptr data_in;    /* reference for the input buffer */
ai_ptr data_out;   /* reference for the output buffer */

void ai_init(void)
{
  /* AI associated Configuration */
  ...
  /* discover an AI model from current application */
  ai_model_info *nn = ...
  /* 
   * ai_create calls model-specific create function.
   */
  err = ai_create(nn, AI_MODEL_activations(nn), &handle);
  if (err != AI_STATUS_OK) {
    ...
  }
  /* handle is globally set, if no error */

  /* 
   * Allocate input/output tensors
   */

  /* sanity IO number check */
  if (ai_get_input_size(handle) != 1 || 
     ai_get_input_size(handle) != 1)
     return -1;

  DLTensor *dl_tensor;
  
  ai_tensor *input_tensor = ai_get_input(handle, 0);
  dl_tensor = get_dltensor(input_tensor);
  /* built-in allocated tensor? */
  if (dl_tensor->data == NULL) {
    uint32_t bytes = get_tensor_size (input_tensor);
    dl_tensor->data = (ai_ptr)malloc(bytes);
  }
  data_in = dl_tensor->data;

  ai_tensor *output_tensor = ai_get_output_size(handle, 0);
  dl_tensor = get_dltensor(input_tensor);
  if (dl_tensor->data == NULL) {
    uint32_t bytes = get_tensor_size (output_tensor);
    dl_tensor->data = (ai_ptr)malloc(bytes);
  }
  data_out = dl_tensor->data;

  return;

void ai_deinit() {
  /* release the allocate resources (if necessary) */
  ...
  /* deallocate the model instance */
  err = ai_destroy(handle);
  if (err != AI_STATUS_OK) {
    ...
  }
}

int main(void)
{  
  /* MCU Configuration */
  ...
  /* Model Init */
  ai_init();

  /* Main process loop */
  while (cond) {
    /* 1 - Acquire, pre-process and fill the input buffers */
    acquire_and_pre_process_data(data_in);

    /* 2 - Call inference engine */
    err = ai_run(handle);
    if (err != AI_STATUS_OK) {
      ...
    }
    /* 3 - Post-process the predictions */
    post_process(data_out);
  }

  ai_deinit();

}
```

## Relation to the Project API RFC

This RFC has two components:
- The STM32 code emitter and its associated runtime support described above
- The STM32 demo application

The first component, the STM32 code emitter and its runtime, belongs to the 
compiler system (TVM) rather then to a separate standalone project.
The code emitter takes a TVM Module and
generates a C implementation of the model graph. It is tightly-coupled to
the TVM code base.
The code emitter is also dependent on a particular runtime support, similarly
to a C compiler, eg. gcc based on gcc runtime libraries. 
Preferably, the objective here would be to have a generic runtime API 
that fits different target platforms and deployment scenarios, while the 
implementation would be target-specific (similar to the GraphRuntime). 
However, 
we can imagine a variety of deployment scenarios and
execution models, which may require different runtime APIs.
This point is still to be clarified.

The second component, the STM32 demo application, fits well with the
[Project API](https://discuss.tvm.apache.org/t/rfc-tvm-project-api/9449)
proposal, roughly following the 'Standalone Demo Project Generator' flow.
It may be considered as implementing two 
  of the [Project 
API](https://discuss.tvm.apache.org/t/rfc-tvm-project-api/9449)
  building blocks:
- A project template
- A transport layer

The demo application can be eventually integrated with the
[Project API](https://discuss.tvm.apache.org/t/rfc-tvm-project-api/9449),
as well as within the upcoming AutoTuning infrastructure.

## Conclusion

In this RFC we outlined a proposal for the standalone code generation for
ML models in embedded and bare-metal development environments. A
[PR](https://github.com/apache/tvm/pull/7742) targeting
the STM32 microcontrollers is also available.
The proposal falls in the line of developments already underway in the TVM
community:

- AoT code generation: We propose a complimentary, more lightweight
  approach. A C code for the model is generated enabling standard 
  embedded development flow. We expose more model information to the
  main application compared to the 
[AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206).
  Out lightweight approach can be used to quickly develop standalone
  code generators for new targets. 

- Embedded Runtime C API: We propose a richer application API 
  compared to the 
[AoT](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), based on 
our experience with an industrial embedded development 
  environment.

- Project Integration: We propose a STM32 demo application that has been
  tested on a number of ML models with the STM32 Discovery ARM based 
  development board. 
  We propose to contribute several building blocks that 
  can be integrated with the framework from [Project 
API](https://discuss.tvm.apache.org/t/rfc-tvm-project-api/9449).

Please share your thoughts/feedback!





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-standalone-code-generation-and-c-runtime-for-stm32-bare-metal-devices/9562/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/5bae830ad35f43d000463c389755edfe213811620266a03e671a33810c2fbd63).

[Apache TVM Discuss] [Development] [RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices

Reply via email to