[quote="Mousius, post:1, topic:9849"] * this all as the default AOT behaviour for now rather than providing a compiler flag * Maintain current packed function signature and instead just change the unwrapping from DLTensor to pointers - this is problematic as to which level the user is informed of an error, with a changed signature you’d get a link error rather than a segfault if you tried to use this for dynamic linking [/quote]
Thanks @Mousius for bringing this up. I think it would still worth to think a bit more to formalize the wants. In particular, there are a few parts of items that are worth considering: ## Formalize the Transformation of Buffer to the Pointer(Change of Fn Signature) The particular transformation we are looking for is actually to transform a function with buffer map by directly passing in its data pointer. Note that most of TVM's lowering transformations preserves Consider the constructing code ```python import tvm import tvm.script from tvm import te def unpacked_example(): A = te.placeholder((4 , 5), name="A") B = te.compute(A.shape, lambda *i: A(*i) + 1, name="B") s = te.create_schedule(B.op) mod = tvm.lower(s, [A, B]) print(tvm.script.asscript(mod)) unpacked_example() ``` This will give us the following script ```python @tvm.script.tir class Module: def main(A: ty.handle, B: ty.handle) -> None: # function attr dict tir.func_attr({"global_symbol": "main", "tir.noalias": True}) A_1 = tir.match_buffer(A, [4, 5], elem_offset=0, align=128, offset_factor=1) B_1 = tir.match_buffer(B, [4, 5], elem_offset=0, align=128, offset_factor=1) # body for i0, i1 in tir.grid(4, 5): B_1.data[((i0*5) + i1)] = (tir.load("float32", A_1.data, ((i0*5) + i1)) + tir.float32(1)) ``` >From the data structure's PoV, the above code only refers to the pointer >`B_1.data` and `A_1.data`. We can create a pass (say replace signature) that >tries to changes the signature of the function from list of buffers(requires DLTensor*) to just the data pointers themselves. This transformation can hold as long as all the referenced variables are covered, and the desired code is lile ```python @tvm.script.tir class Module: def main(Adata: ty.handle, Bdata: ty.handle) -> None: # body for i0, i1 in tir.grid(4, 5): tir.store(Bdata, tir.load("float32", Adata, ((i0*5) + i1)) + tir.float32(1)), ((i0*5) + i1)) ``` Note then the function can be directly passed to the code generator, which generates the function with signature ```c int main_func(void* Adata, void* Bdata); ``` The main point is that we do not have to try to twist the MakePackedAPI to generate another kind of type erased API here. If what we want is the normal C calling convention that passes in the field separately, we should add this pass after lowering to change the expected function signature, then the followup calls would follow naturally (via the normal C function calling convention). The MakePackedAPI as it is is supposed to preserve the function signature(of taking buffers) regardless of unpacked choices. So transforming the signature should go to another pass. ### The Choice of Type-Erased API My main concern about the current proposal is the introduction of another type erased interface, namely ``` typedef int32_t(tvm_function_t)(void** inputs, void** outputs, void* resource_handle);` ``` Given most of the internals can already be readibly handled by the raw C typed version. The only usage of the type-erased function is when it comes to interfaces. In that case, I still beleive that PackedC function is the right choice, as it brings the benefit of standarization and consistency with the overall TVM ecosystem. ### Impact of Compiler Optimizations Finally, it is still useful to think about compiler optimizations and how can they impact the choices in the table. Modern compilers can do a lot of things, making it possibly to get as optimized code as long as we can inline the function correctly. Let us consider an example code below ``` #include <cstdio> #include <tvm/runtime/c_runtime_api.h> inline int PackedCFunc(void* args, int* type_codes, int num_args, void* out_ret_value, int* out_ret_tcode, void* resource_handle) { DLTensor* dlx = (DLTensor*)(((TVMValue*)args)[0].v_handle); DLTensor* dly = (DLTensor*)(((TVMValue*)args)[1].v_handle); // error check that can be dead-code eliminated if (type_codes[0] != kTVMDLTensorHandle) { return -1; } if (type_codes[1] != kTVMDLTensorHandle) { return -1; } if (dlx->shape[0] != 3) { return -1; } if (dlx->shape[1] != 3) { return -1; } if (dly->shape[0] != 3) { return -1; } if (dly->shape[1] != 3) { return -1; } if (dly->dtype.code != kDLFloat) { return -1; } ((float*)dlx->data)[0] = ((float*)dly->data)[0] + 1; return 0; } // return y[i] = x[i] +1 extern "C" int AddViaPackedCFunc(float *x, float* y) { TVMValue args[2]; int type_codes[2]; TVMValue out_ret_value; int out_ret_tcode; int64_t shape[2] = {3, 3}; DLTensor dlx, dly; dlx.data = x; dlx.ndim = 2; dlx.shape = shape; dlx.dtype.code = kDLFloat; dlx.dtype.bits = 32; dlx.dtype.lanes = 1; dlx.device.device_type = kDLCPU; dlx.device.device_id = 0; dlx.strides = nullptr; dlx.byte_offset = 0; dly = dlx; dly.data = y; args[0].v_handle = &dlx; args[1].v_handle = &dly; type_codes[0] = kTVMDLTensorHandle; type_codes[1] = kTVMDLTensorHandle; // note: check can be dead-code eliminated if (PackedCFunc(args, type_codes, 2, &out_ret_value, &out_ret_tcode, nullptr) != 0) { printf("error\n"); } return 0; } ``` Run clang ```bash clang-10 -O2 -emit-llvm -S -I ../../tvm/3rdparty/dlpack/include -I ../../tvm/include -o test.ll test.cc ``` The result is ```ll ; Function Attrs: nounwind uwtable define dso_local i32 @AddViaPackedCFunc(float* %0, float* %1) local_unnamed_addr #0 { %3 = load float, float* %1, align 4, !tbaa !2 %4 = fadd float %3, 1.000000e+00 store float %4, float* %0, align 4, !tbaa !2 ret i32 0 } ``` Run gcc ``` gcc -O2 -S -I ../../tvm/3rdparty/dlpack/include -I ../../tvm/include -o test.s test.cc ``` Gives the following asm code ``` .file "test.cc" .text .p2align 4,,15 .globl AddViaPackedCFunc .type AddViaPackedCFunc, @function AddViaPackedCFunc: .LFB31: .cfi_startproc movss .LC0(%rip), %xmm0 xorl %eax, %eax addss (%rsi), %xmm0 movss %xmm0, (%rdi) ret .cfi_endproc .LFE31: .size AddViaPackedCFunc, .-AddViaPackedCFunc .section .rodata.cst4,"aM",@progbits,4 .align 4 .LC0: .long 1065353216 .ident "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0" .section .note.GNU-stack,"",@progbits ``` As we can see that even with the same PackedFunc API, as long as we can do proper inlining, allocating DLTensor and other items on stack, the resulting function call can be reduced to the same function as the minimum non-packed version. ### Discussions Considering the importance of a minimum internal, I agree that we could explore an un-packed interface(essentially generating something that is related to C). We should do that in a proper way, by introducing a function signature transformation utility that transforms the function signature from the original DLTensor* to the destructed fields. However, we should also note that generating the DLTensor on stack and setting up constant correctly might also bring similar effect in a modern compiler. When it comes to type-erased interface, assuming we only need them at the interface level(not the internals). I think it is useful to keep the CPackedFunc convention, so that we still retain the benefit of additional wraping to expose to the externals and standardization. Again in this case carefully allocating the DLTensor on stack then pass it in plus strong inlining/constant folding could remove the overhead of DLTensor even at the interface level. --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-utvm-aot-optimisations-for-embedded-targets/9849/2) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/86c400f3873babb8f3c4c42b921d491a97ce8acfe63247967f110a28f7d71f82).