Yeah I can see the difficulty you mentioned, and it might be possible that nvcc is not available in runtime if the model is deployed to an edge device.
A combined approach would be leveraging the third BYOC option: custom codegen/runtime. Specifically, we still generate the C/CUDA kernel and compile them using NVCC at the compile time, but instead of using the C source module you're currently using, we treat the generated/compiled kernels as "graphs". Meanwhile, we also serialize the constants to a JSON file. Thus, our artifacts include compiled kernels (in binary) and constants (in JSON). This is sort of similar to Xilinx Vitis-AI and Arm Ethos-N backends, which generate a binary/bit-stream in the desired format, and use their own runtime for execution. In addition, we make a runtime engine that loads the compiled kernels and deserializes the constants. In this way, the runtime could still be light-weight and should be easy to implement, because all it needs to do is invoking the corresponding kernel by its symbol and feeding the right data entries. We don't need to have a JSON interpreter to traverse the JSON subgraph and generate the engine like TensorRT. btw, I'm also curios how @Laurawly deals with the specialized weight layout with the C codegen. --- [Visit Topic](https://discuss.tvm.apache.org/t/byoc-cutlass-dealing-with-constants-in-c-source-gen-based-byoc/11362/5) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/85ced833e53c1204f249728b866f85930825b9c8862d36ef3e321aaf42ce5e1c).