[Apache TVM Discuss] [Development/RFC] [RFC] [uTVM] AOT optimisations for Embedded Targets

tqchen via Apache TVM Discuss Thu, 29 Apr 2021 14:12:10 -0700


[quote="Mousius, post:1, topic:9849"]
* this all as the default AOT behaviour for now rather than providing a 
compiler flag
* Maintain current packed function signature and instead just change the 
unwrapping from DLTensor to pointers - this is problematic as to which level 
the user is informed of an error, with a changed signature you’d get a link 
error rather than a segfault if you tried to use this for dynamic linking
[/quote]


Thanks @Mousius for bringing this up. I think it would still worth to think a 
bit more to formalize the wants. In particular, there are a few parts of items 
that are worth considering:

## Formalize the Transformation of Buffer to the Pointer(Change of Fn Signature)

The particular transformation we are looking for is actually to transform a 
function with buffer map by directly passing in its data pointer. Note that 
most of TVM's lowering transformations preserves

Consider the constructing code 
```python
import tvm                                                                      
                                                                                
                                            
import tvm.script                                                               
                                                                                
                                            
from tvm import te                                                              
                                                                                
                                            
                                                                                
                                                                                
                                            
def unpacked_example():                                                         
                                                                                
                                            
    A = te.placeholder((4 , 5), name="A")                                       
                                                                                
                                            
    B = te.compute(A.shape, lambda *i: A(*i) + 1, name="B")                     
                                                                                
                                            
    s = te.create_schedule(B.op)                                                
                                                                                
                                                                                
                                                                                
                                                                  
    mod = tvm.lower(s, [A, B])                                                  
                                                                                
                                            
    print(tvm.script.asscript(mod))                                             
                                                                                
                                            
                                                                                
                                                                                
                 
unpacked_example()                                                              
                                                                                
                                            
```

This will give us the following script
```python
@tvm.script.tir
class Module:
    def main(A: ty.handle, B: ty.handle) -> None:
        # function attr dict
        tir.func_attr({"global_symbol": "main", "tir.noalias": True})
        A_1 = tir.match_buffer(A, [4, 5], elem_offset=0, align=128, 
offset_factor=1)
        B_1 = tir.match_buffer(B, [4, 5], elem_offset=0, align=128, 
offset_factor=1)
        # body
        for i0, i1 in tir.grid(4, 5):
            B_1.data[((i0*5) + i1)] = (tir.load("float32", A_1.data, ((i0*5) + 
i1)) + tir.float32(1))
```
>From the data structure's PoV, the above code only refers to the pointer 
>`B_1.data` and `A_1.data`. We can create a pass (say replace signature) that 
>tries to changes the signature of the function from list
of buffers(requires DLTensor*) to just the data pointers themselves. This 
transformation
can hold as long as all the referenced variables are covered, and the desired 
code is lile

```python
@tvm.script.tir
class Module:
    def main(Adata: ty.handle, Bdata: ty.handle) -> None:
        # body
        for i0, i1 in tir.grid(4, 5):
            tir.store(Bdata, tir.load("float32", Adata, ((i0*5) + i1)) + 
tir.float32(1)), ((i0*5) + i1))
```
Note then the function can be directly passed to the code generator, which 
generates the
function with signature
```c
int main_func(void* Adata, void* Bdata);
```

The main point is that we do not have to try to twist the MakePackedAPI to 
generate another kind of type erased API here. If what we want is the normal C 
calling convention that passes in the field separately, we should add this pass 
after lowering to change the expected function signature, then the followup 
calls would follow naturally (via the normal C function calling convention).

The MakePackedAPI as it is is supposed to preserve the function signature(of 
taking buffers) regardless of unpacked choices. So transforming the signature 
should go to another pass.

### The Choice of Type-Erased API
My main concern about the current proposal is the introduction of another type 
erased interface, namely
 
```
typedef int32_t(tvm_function_t)(void** inputs, void** outputs, void* 
resource_handle);`
```

Given most of the internals can already be readibly handled  by the raw C typed 
version. The only usage of the type-erased function is when it comes to 
interfaces. In that case, I still beleive that PackedC function is the right 
choice, as it brings the benefit of standarization and consistency with the 
overall TVM ecosystem.

### Impact of Compiler Optimizations

Finally, it is still useful to think about compiler optimizations and how can 
they impact the choices in the table. Modern compilers can do a lot of things, 
making it possibly to get as optimized code as long as we can inline the 
function correctly.

Let us consider an example code below
```
#include <cstdio>
#include <tvm/runtime/c_runtime_api.h>

inline int PackedCFunc(void* args, int* type_codes, int num_args,
                       void* out_ret_value, int* out_ret_tcode,
                       void* resource_handle) {
  DLTensor* dlx = (DLTensor*)(((TVMValue*)args)[0].v_handle);
  DLTensor* dly = (DLTensor*)(((TVMValue*)args)[1].v_handle);
  // error check that can be dead-code eliminated
  if (type_codes[0] != kTVMDLTensorHandle) {
    return -1;
  }
  if (type_codes[1] != kTVMDLTensorHandle) {
    return -1;
  }
  if (dlx->shape[0] != 3) {
    return -1;
  }
  if (dlx->shape[1] != 3) {
    return -1;
  }

  if (dly->shape[0] != 3) {
    return -1;
  }
  if (dly->shape[1] != 3) {
    return -1;
  }
  if (dly->dtype.code != kDLFloat) {
    return -1;
  }

  ((float*)dlx->data)[0] =   ((float*)dly->data)[0] + 1;
  return 0;
}

// return y[i] = x[i]  +1
extern "C" int AddViaPackedCFunc(float *x, float* y) {
  TVMValue args[2];
  int type_codes[2];
  TVMValue out_ret_value;
  int out_ret_tcode;
  int64_t shape[2] = {3, 3};

  DLTensor dlx, dly;
  dlx.data = x;
  dlx.ndim = 2;
  dlx.shape = shape;
  dlx.dtype.code = kDLFloat;
  dlx.dtype.bits = 32;
  dlx.dtype.lanes = 1;
  dlx.device.device_type = kDLCPU;
  dlx.device.device_id = 0;
  dlx.strides = nullptr;
  dlx.byte_offset = 0;
  dly = dlx;
  dly.data = y;

  args[0].v_handle = &dlx;
  args[1].v_handle = &dly;

  type_codes[0] = kTVMDLTensorHandle;
  type_codes[1] = kTVMDLTensorHandle;

  // note: check can be dead-code eliminated
  if (PackedCFunc(args, type_codes, 2, &out_ret_value, &out_ret_tcode, nullptr) 
!= 0) {
    printf("error\n");
  }     
  return 0;
}
```
Run clang
```bash
clang-10 -O2 -emit-llvm -S  -I ../../tvm/3rdparty/dlpack/include -I 
../../tvm/include -o test.ll test.cc
```
The result is
```ll
; Function Attrs: nounwind uwtable
define dso_local i32 @AddViaPackedCFunc(float* %0, float* %1) 
local_unnamed_addr #0 {
  %3 = load float, float* %1, align 4, !tbaa !2
  %4 = fadd float %3, 1.000000e+00
  store float %4, float* %0, align 4, !tbaa !2
  ret i32 0
}
```
Run gcc
```
gcc -O2 -S  -I ../../tvm/3rdparty/dlpack/include -I ../../tvm/include -o test.s 
test.cc
```
Gives the following asm code
```
        .file   "test.cc"
        .text
        .p2align 4,,15
        .globl  AddViaPackedCFunc
        .type   AddViaPackedCFunc, @function
AddViaPackedCFunc:
.LFB31:
        .cfi_startproc
        movss   .LC0(%rip), %xmm0
        xorl    %eax, %eax
        addss   (%rsi), %xmm0
        movss   %xmm0, (%rdi)
        ret
        .cfi_endproc
.LFE31:
        .size   AddViaPackedCFunc, .-AddViaPackedCFunc
        .section        .rodata.cst4,"aM",@progbits,4
        .align 4
.LC0:
        .long   1065353216
        .ident  "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
        .section        .note.GNU-stack,"",@progbits
```

As we can see that even with the same PackedFunc API, as long as we can do 
proper inlining, allocating DLTensor and other items on stack, the resulting 
function call can be reduced to the same function as the minimum non-packed 
version. 

### Discussions

Considering the importance of a minimum internal, I agree that we could explore 
an un-packed interface(essentially generating something that is related to C). 
We should do that in a proper way, by introducing a function signature 
transformation utility that transforms the function signature from the original 
DLTensor* to the destructed fields.

However, we should also note that generating the DLTensor on stack and setting 
up constant correctly might also bring similar effect in a modern compiler.

When it comes to type-erased interface, assuming we only need them at the 
interface level(not the internals). I think it is useful to keep the 
CPackedFunc convention, so that we still retain the benefit of additional 
wraping to expose to the externals and standardization. Again in this case 
carefully allocating the DLTensor on stack then pass it in plus strong 
inlining/constant folding could remove the overhead of DLTensor even at the 
interface level.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-utvm-aot-optimisations-for-embedded-targets/9849/2)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/86c400f3873babb8f3c4c42b921d491a97ce8acfe63247967f110a28f7d71f82).

[Apache TVM Discuss] [Development/RFC] [RFC] [uTVM] AOT optimisations for Embedded Targets

Reply via email to