tra added a comment.

How much does this inlining buy you in practice? I.e. what's a typical launch 
latency before/after the patch? For CUDA, config push/pop is negligible 
compared to the cost of actually launching the kernel on the GPU. It is 
measurable if the launch is asynchronous, but queueing kernels fast, does not 
help all that much in the long run -- you eventually have to run those kernels 
on the GPU, so in most cases you're just spend a bit more time idling while 
waiting for the queued kernels to finish. To be beneficial, you'll need a 
finely balanced CPU/GPU workload and that's rather hard to achieve. Not to the 
point where the minor savings here would be meaningful. I would assume the 
situation on AMD GPUs is not that different.

One side effect of this patch is that there will be no convenient way to set 
host-side breakpoint on kernel launch.
Another will be that examining call stack will become somewhat confusing as the 
arguments passed to the kernel as written in the source code will not match 
those observed in the stack trace. I guess preserving the appearance of normal 
function calls was the reason for the split  config setup/kernel launch in 
CUDA.  I'd say it's still useful to have as CUDA-specific debugger is not 
always available and one must use regular gdb on CUDA apps now and then.

If the patch does give measurable performance improvement, can we implement 
launch config push/pop in a way that compiler can eliminate by itself when it's 
possible and keep the stub as the host-side kernel entry point? I would prefer 
to avoid sacrificing debugging usability for performance optimizations that may 
not matter.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to