I am not super familiar with the Unity direction, but keeping BYOC sounds like
a good idea. I don't know if this is how its supposed to be used, but I am
using it as "catch all" way to extend TVM. I'm currently adding some custom
opencl kernels for depthwise conv2d: the way that I am planning
as long as LLM workloads are still composed tensor programs, then TVM just has
to positiion itself as a more general tensor program compiler moreso than an ML
compiler. The tensor expression and Ansor projects look perfectly suited for
this/
---
[Visit
Topic](https://discuss.tvm.apache.o
Thanks!
I'm not familiar with this project BitBlas. Please correct me if I am wrong: in
the code you showed, the IRModule pass that retrieves the threadblock
dimensions is
[get_annotated_device_mod](https://github.com/microsoft/BitBLAS/blob/2f6d316be9f9d70f2845c2f319ac2f348d0cd6a6/bitblas/uti
One suggestion that I have for TVM is to add a cleaner exit from the stack.
For example, for opencl/ cuda targets, what do I do if I just want the
generated kernels?
Note: there is a way to print the source for CL, but unfortunately I have not
found a way to get the work group / threadblock s
@echuraev @elvin-n
How did you get the work group sizings from tvm for the opencl target on Adreno
GPU?
I saw your samples here: [qualcomm/apps/OpenCLCppRunner at master ·
Deelvin/qualcomm · GitHub
](https://github.com/Deelvin/qualcomm/tree/master/apps/OpenCLCppRunner)
I see that you obtai