Thanks @giuseros .  To just discuss a bit on the difference between the 
following two type erased interface


## X0: Function with typeid 
```c
typedef int (*TVMBackendPackedCFunc)(TVMValue* args, int* type_codes, int 
num_args,
                                     TVMValue* out_ret_value, int* 
out_ret_tcode,
                                     void* resource_handle)
``` 

## X1: Function without typeid

```c
typedef int (*TVMBackendCFunc)(void** inputs, void** outputs, void* 
resource_handle);
```

## Discussions

The main reason that we choosed X0 over X1 is because X0 gives a safe interface 
for both static and dynamic languages. Imagine a case where the callee passes 
in a integer but the caller expects a float. 
X1 won't provide any mechanism to detect such mismatch during runtime(if debug 
is enabled) while
X0 allows us to provide type checking to do so.

Making a function call in the X1 convention would also requires stack 
allocations(for the array of inputs and outputs).  Without considering any 
compiler optimization, if we get down to number of bytes in a 32bit system, a 
function call with `n` number of arguments one output.  A function call in the 
form of X0 would cost us `8 * n + 4 * n + 4 + 8+ 4 + 4 = 12 * n +20` bytes of 
space, while a function call in the form of X1 would cost us  `4* n + 4 * n + 4 
 = 8 * n +20` bytes of space.  Say n=3 (a typical number), then 
X0 would cost `84 bytes`, while X1 will cost `44 bytes`. 

The memory overhead of the function call, when comparing to the followup memory 
operations on NDArrays(which normally contains KB or more memory) is negilible.

Additionally, this is considering no compiler optimization. Let us think about 
what will happen when the compiler inlines the call. In such cases the function 
call becomes a load and store into a heap memory. 

With a typical `mem2reg` pass, the heap space can be promoted to registers. If 
callee code(operator) is compiled to not read the typeid in release mode, then 
the assignment to typeid becomes deadcode and will be eliminated by the 
compiler.  Similarly the argument passing could become direct argument passing 
in this case. Considering these compiler optimizations, both X0 and X1 would 
allow optimizations that leads to similar performing code as the final direct 
call form.

Back to the topic of the int64, note that most of our operator call only uses 
`void*` as argument and not int64. The cost of  `int64` is mainly a memory 
overhead of passing argument rather than an ALU concern, both the caller and 
callee can feel free to convert to int32 after the passing, and assign int32 
fields during passing, again considering possible compiler optimizations above 
this could turns out to be nop.

Even in the absence of compiler optimizations, the general overhead incurred in 
the X0 is not too much larger than X1, and it would be great to do some on 
workload of interest to see the difference.





---
[Visit Topic](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206/30) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/fe2f6652e01d185e980099cf1f6f8885eba5add8a48f099fe2f3e144979dc307).

Reply via email to