junrushao commented on issue #285: URL: https://github.com/apache/tvm-ffi/issues/285#issuecomment-3576828253
Thanks for asking @dasikuzi2! First of all, there are some printing issues in your script - it runs 1e8 times instead 1e6. In this case, the overhead is 0.3-0.4 us in both TVM-FFI and pybind, which is actually really good. To share some idea, usually we expect 3-4 us for ML kernel calls for any meaningful real-world application (e.g. elementwise add, softmax). Without TVM-FFI, it would become >10 us (TileLang / pybind) or >100us (Triton). See TileLang's number for TVM-FFI: <img width="1103" height="343" alt="Image" src="https://github.com/user-attachments/assets/43e65081-6661-4748-a22c-e21850311d53" /> Another issue is the workload you are using - it's `add_one_cpu` method you are referring to, its source code is: ```C++ long long cal(long long a) { long long sum = 0; for (int i = 0; i < a; ++i) { sum += i; } return sum; } long long new_cal(long long a = 0) { // 添加默认参数 return cal(a); // 调用原始的 cal 函数,传入参数 } ``` which is a pure scalar non-tensor method - not a tensor-related computation we primarily target. On this workload, indeed there's a little room 0.1us we could squeeze, e.g. using direct C DLL calls, but for now I really don't think it's meaningful to squeeze 0.1us for non-tensor workloads :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
