I would still recommend using the PackedFunc interface, as there can be quite a few things that are needed to directly use the raw kernel, for example, the launching parameter calculation is part of the host code, as well as the data unpacking.
Depending on the schedule, we could also generate a function that contains multiple cuda kernel launches. If the primary concern is C++ ABI, TVM runtime contains a C interface which has a stable ABI, see https://github.com/apache/incubator-tvm/blob/main/include/tvm/runtime/c_runtime_api.h --- [Visit Topic](https://discuss.tvm.apache.org/t/whatt-the-arguments-order-of-tvm-generated-cuda-kernel/8422/6) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/9b8279fb2e2fbd32271b7339152055a25503d4b79585d2d707c25cfb21f1ee78).