tra added a comment. In D86376#2236501 <https://reviews.llvm.org/D86376#2236501>, @yaxunl wrote:
> My previous measurements did not warming up, which caused some one time > overhead due to device initialization and loading of device binary. With warm > up, the call of `__hipPushCallConfigure/__hipPopCallConfigure` takes about 19 > us. Based on the trace from rocprofile, the time spent inside these functions > can be ignored. Most of the time is spent making the calls. These functions > stay in a shared library, which may be the reason why they take such long > time. Making them always_inline may get rid of the overhead, however, that > would require exposing internal data structures. It's still suspiciously high. AFAICT, config/push/pull is just an std::vector push/pop. It should not take *that* long. Few function calls should not lead to microseconds of overhead, once linker has resolved the symbol, if they come from a shared library. https://github.com/ROCm-Developer-Tools/HIP/blob/master/vdi/hip_platform.cpp#L590 I wonder if it's the logging facilities that add all this overhead. > The kernel launching latency are measured by a simple loop in which a simple > kernel is launched then hipStreamSynchronize is called. trace is collected by > rocprofiler and the latency is measured from the end of hipStreamSynchronize > to the real start of kernel execution. Without this patch, the latency is > about 77 us. With this patch, the latency is about 46 us. The improvement is > about 40%. The decrement of 31 us is more than 19 us since it also eliminates > the overhead of kernel stub. This is rather surprising. A function call by itself does *not* have such high overhead. There must be something else. I strongly suspect logging. If you remove logging statements from push/pop without changing anything else, how does that affect performance? >>> I would like to say the motivation of this change is two folds: 1. improve >>> latency 2. interoperability with C++ programs. >> >> Could you elaborate on the "interoperability with C++ programs"? I don't >> think I see how this patch helps with that. Or what exactly is the issue >> with C++ interoperability we have now? > > In HIP program, a global symbol is generated in host binary to identify each > kernel. This symbol is associated with the device kernel by a call of > hipRegisterFunction in init functions. Each time the kernel needs to be > called, the associated symbol is passed to hipLaunchKernel. In host code, > this symbol represents the kernel. Let's call it the kernel symbol. Currently > it is the kernel stub function, however, it could be any global symbol, as > long as it is registered with hipRegisterFunction, then hipLaunchKernel can > use it to find the right kernel and launch it. So far so good, it matches the way CUDA does that. > In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the > kernel symbol. Do you mean the host-side symbol, registered with the runtime that you've described above? Or do you mean that the device-side symbol is somehow visible from the host side. I think that's where HIP is different from CUDA. > Since the kernel symbol is defined in object files generated from HIP. > For C/C++ program, as long as it declares the kernel symbol as an external > function or variable which matches the name of the original symbol, the > linker will resolve to the correct kernel symbol, then the correct kernel can > be launched. The first sentence looks incomplete. It seems to imply that hipLaunchKernel uses the device-side kernel symbol and it's the linker which ties host-side reference with device-side symbol. If that's the case, then I don't understand what purpose is served by hipRegisterFunction. AFAICT, it's not used in this scenario at all. My mental model of kernel launch mechanics looks like this: - For a kernel foo, there is a host-side symbol (it's the stub for CUDA) with the name 'foo' and device-side real kernel 'foo'. - host side linker has no access to device-side symbols, but we do need to associate host and device side 'foo' instances. - address of host-side foo is registered with runtime to map it to device symbol with the name 'foo' - when a kernel is launched, call site sets up launch config and calls the stub, passing it the kernel arguments. - the stub calls the kernel launch function, and passes host-side foo address to the kernel launch function - launch function finds device-side symbol name via the registration info and does device-side address lookup to obtain it's device address - run device-side function. In this scenario, the host-side stub for foo is a regular function, which gdb can stop on and examine kernel arguments. How is the process different for HIP? I know that we've changed the stub name to avoid debugger confusion about which if the entities corresponds to 'foo'. > Here comes the nuance with kernel stub function as the kernel symbol. If you > still remember, there was a previous patch for HIP to change the kernel stub > name. rocgdb requires the device stub to have a different name than the real > kernel, since otherwise it will not be able to break on the real kernel only. > As a result, the kernel stub now has a prefix `__device_stub_` before > mangling. > > For example, a kernel `foo` will have a kernel stub with name > `__device_stub_foo`. > > For a C/C++ program to call kernel `foo`, it needs to declare an external > symbol `__device_stub_foo` then launch it. Of course this is an annoyance for > C/C++ users, especially this involves mangled names. It's all done by compiler under the hood. I'm not sure how the stub name affects C/C++ users. > However, we cannot change the name of the kernel stub to be the same as the > kernel, since that will break rocgdb. > Now the solution is to get rid of the kernel stub function. Instead of use > kernel stub function as kernel symbol, we will emit a global variable as > kernel symbol. This global variable can have the same name as the kernel, > since rocgdb will not break on it. I do not follow your reasoning why the stub name is a problem. It's awkward, yes, but losing the stub as a specific kernel entry point seems to be a real loss in debugability, which is worse, IMO. Could you give me an example where the stub name causes problems? CHANGES SINCE LAST ACTION https://reviews.llvm.org/D86376/new/ https://reviews.llvm.org/D86376 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits