Yeah I can see the difficulty you mentioned, and it might be possible that nvcc 
is not available in runtime if the model is deployed to an edge device.

A combined approach would be leveraging the third BYOC option: custom 
codegen/runtime. Specifically, we still generate the C/CUDA kernel and compile 
them using NVCC at the compile time, but instead of using the C source module 
you're currently using, we treat the generated/compiled kernels as "graphs". 
Meanwhile, we also serialize the constants to a JSON file. Thus, our artifacts 
include compiled kernels (in binary) and constants (in JSON). This is sort of 
similar to Xilinx Vitis-AI and Arm Ethos-N backends, which generate a 
binary/bit-stream in the desired format, and use their own runtime for 
execution.

In addition, we make a runtime engine that loads the compiled kernels and 
deserializes the constants. In this way, the runtime could still be 
light-weight and should be easy to implement, because all it needs to do is 
invoking the corresponding kernel by its symbol and feeding the right data 
entries. We don't need to have a JSON interpreter to traverse the JSON subgraph 
and generate the engine like TensorRT.

btw, I'm also curios how @Laurawly deals with the specialized weight layout 
with the C codegen.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/byoc-cutlass-dealing-with-constants-in-c-source-gen-based-byoc/11362/5)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/85ced833e53c1204f249728b866f85930825b9c8862d36ef3e321aaf42ce5e1c).

Reply via email to