junrushao commented on issue #285:
URL: https://github.com/apache/tvm-ffi/issues/285#issuecomment-3576828253

   Thanks for asking @dasikuzi2!
   
   First of all, there are some printing issues in your script - it runs 1e8 
times instead 1e6. In this case, the overhead is 0.3-0.4 us in both TVM-FFI and 
pybind, which is actually really good.
   
   To share some idea, usually we expect 3-4 us for ML kernel calls for any 
meaningful real-world application (e.g. elementwise add, softmax). Without 
TVM-FFI, it would become >10 us (TileLang / pybind) or >100us (Triton). See 
TileLang's number for TVM-FFI:
   
   <img width="1103" height="343" alt="Image" 
src="https://github.com/user-attachments/assets/43e65081-6661-4748-a22c-e21850311d53";
 />
   
   Another issue is the workload you are using - it's `add_one_cpu` method you 
are referring to, its source code is:
   
   ```C++
   long long cal(long long a) {
       long long sum = 0;
       for (int i = 0; i < a; ++i) {
           sum += i;
       }
       return sum;
   }
   
   long long new_cal(long long a = 0) {  // 添加默认参数
       return cal(a);  // 调用原始的 cal 函数,传入参数
   }
   ```
   
   which is a pure scalar non-tensor method - not a tensor-related computation 
we primarily target. On this workload, indeed there's a little room 0.1us we 
could squeeze, e.g. using direct C DLL calls, but for now I really don't think 
it's meaningful to squeeze 0.1us for non-tensor workloads :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to