[PATCH] D86376: [HIP] Improve kernel launching latency

Artem Belevich via Phabricator via cfe-commits Tue, 25 Aug 2020 10:35:25 -0700

tra added a comment.

In D86376#2236501 <https://reviews.llvm.org/D86376#2236501>, @yaxunl wrote:


> My previous measurements did not warming up, which caused some one time 
> overhead due to device initialization and loading of device binary. With warm 
> up, the call of `__hipPushCallConfigure/__hipPopCallConfigure` takes about 19 
> us. Based on the trace from rocprofile, the time spent inside these functions 
> can be ignored. Most of the time is spent making the calls. These functions 
> stay in a shared library, which may be the reason why they take such long 
> time. Making them always_inline may get rid of the overhead, however, that 
> would require exposing internal data structures.

It's still suspiciously high. AFAICT, config/push/pull is just an std::vector 
push/pop. It should not take *that* long.  Few function calls should not lead 
to microseconds of overhead, once linker has resolved the symbol, if they come 
from a shared library.
https://github.com/ROCm-Developer-Tools/HIP/blob/master/vdi/hip_platform.cpp#L590

I wonder if it's the logging facilities that add all this overhead.

> The kernel launching latency are measured by a simple loop in which a simple 
> kernel is launched then hipStreamSynchronize is called. trace is collected by 
> rocprofiler and the latency is measured from the end of hipStreamSynchronize 
> to the real start of kernel execution. Without this patch, the latency is 
> about 77 us. With this patch, the latency is about 46 us. The improvement is 
> about 40%. The decrement of 31 us is more than 19 us since it also eliminates 
> the overhead of kernel stub.

This is rather surprising. A function call by itself does *not* have such high 
overhead. There must be something else. I strongly suspect logging. If you 
remove logging statements from push/pop without changing anything else, how 
does that affect performance?

>>> I would like to say the motivation of this change is two folds: 1. improve 
>>> latency 2. interoperability with C++ programs.
>>
>> Could you elaborate on the "interoperability with C++ programs"? I don't 
>> think I see how this patch helps with that. Or what exactly is the issue 
>> with C++ interoperability we have now?
>
> In HIP program, a global symbol is generated in host binary to identify each 
> kernel. This symbol is associated with the device kernel by a call of 
> hipRegisterFunction in init functions. Each time the kernel needs to be 
> called, the associated symbol is passed to hipLaunchKernel. In host code, 
> this symbol represents the kernel. Let's call it the kernel symbol. Currently 
> it is the kernel stub function, however, it could be any global symbol, as 
> long as it is registered with hipRegisterFunction, then hipLaunchKernel can 
> use it to find the right kernel and launch it.

So far so good, it matches the way CUDA does that.

> In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the 
> kernel symbol.

Do you mean the host-side symbol, registered with the runtime that you've 
described above? Or do you mean that the device-side symbol is somehow visible 
from the host side. I think that's where HIP is different from CUDA.

> Since the kernel symbol is defined in object files generated from HIP. 
> For C/C++ program, as long as it declares the kernel symbol as an external 
> function or variable which matches the name of the original symbol, the 
> linker will resolve to the correct kernel symbol, then the correct kernel can 
> be launched.

The first sentence looks incomplete. It seems to imply that hipLaunchKernel 
uses the device-side kernel symbol and it's the linker which ties host-side 
reference with device-side symbol. If that's the case, then I don't understand 
what purpose is served by hipRegisterFunction. AFAICT, it's not used in this 
scenario at all.

My mental model of kernel launch mechanics looks like this:

- For a kernel foo, there is a host-side symbol (it's the stub for CUDA) with 
the name 'foo' and device-side real kernel 'foo'.
- host side linker has no access to device-side symbols, but we do need to 
associate host and device side 'foo' instances.
- address of host-side foo is registered with runtime to map it to device 
symbol with the name 'foo'
- when a kernel is launched, call site sets up launch config and calls the 
stub, passing it the kernel arguments.
- the stub calls the kernel launch function, and passes host-side foo address 
to the kernel launch function
- launch function finds device-side symbol name via the registration info and 
does device-side address lookup to obtain it's device address
- run device-side function.

In this scenario, the host-side stub for foo is a regular function, which gdb 
can stop on and examine kernel arguments.

How is the process different for HIP? I know that we've changed the stub name 
to avoid debugger confusion about which if the entities corresponds to 'foo'.

> Here comes the nuance with kernel stub function as the kernel symbol. If you 
> still remember, there was a previous patch for HIP to change the kernel stub 
> name. rocgdb requires the device stub to have a different name than the real 
> kernel, since otherwise it will not be able to break on the real kernel only. 
> As a result, the kernel stub now has a prefix `__device_stub_` before 
> mangling.
>
> For example, a kernel `foo` will have a kernel stub with name 
> `__device_stub_foo`.
>
> For a C/C++ program to call kernel `foo`, it needs to declare an external 
> symbol `__device_stub_foo` then launch it. Of course this is an annoyance for 
> C/C++ users, especially this involves mangled names.

It's all done by compiler under the hood. I'm not sure how the stub name 
affects C/C++ users.

> However, we cannot change the name of the kernel stub to be the same as the 
> kernel, since that will break rocgdb.
> Now the solution is to get rid of the kernel stub function. Instead of use 
> kernel stub function as kernel symbol, we will emit a global variable as 
> kernel symbol. This global variable can have the same name as the kernel, 
> since rocgdb will not break on it.

I do not follow your reasoning why the stub name is a problem. It's awkward, 
yes, but losing the stub as a specific kernel entry point seems to be a real 
loss in debugability, which is worse, IMO.
Could you give me an example where the stub name causes problems?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D86376: [HIP] Improve kernel launching latency

Reply via email to