[PATCH] D102107: [OpenMP] Codegen aggregate for outlined function captures

Alexey Bataev via Phabricator via cfe-commits Tue, 13 Jul 2021 14:59:22 -0700

ABataev added a comment.

In D102107#2874123 <https://reviews.llvm.org/D102107#2874123>, @jdoerfert wrote:


> In D102107#2867693 <https://reviews.llvm.org/D102107#2867693>, @ABataev wrote:
>
>> In D102107#2867670 <https://reviews.llvm.org/D102107#2867670>, 
>> @josemonsalve2 wrote:
>>
>>> In D102107#2867417 <https://reviews.llvm.org/D102107#2867417>, @ABataev 
>>> wrote:
>>>
>>>> In D102107#2867382 <https://reviews.llvm.org/D102107#2867382>, @jdoerfert 
>>>> wrote:
>>>>
>>>>> In D102107#2832740 <https://reviews.llvm.org/D102107#2832740>, @ABataev 
>>>>> wrote:
>>>>>
>>>>>> In D102107#2832286 <https://reviews.llvm.org/D102107#2832286>, 
>>>>>> @jdoerfert wrote:
>>>>>>
>>>>>>> In D102107#2824581 <https://reviews.llvm.org/D102107#2824581>, @ABataev 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> In D102107#2823706 <https://reviews.llvm.org/D102107#2823706>, 
>>>>>>>> @jdoerfert wrote:
>>>>>>>>
>>>>>>>>> In D102107#2821976 <https://reviews.llvm.org/D102107#2821976>, 
>>>>>>>>> @ABataev wrote:
>>>>>>>>>
>>>>>>>>>> We used this kind of codegen initially but later found out that it 
>>>>>>>>>> causes a large overhead when gathering pointers into a record. What 
>>>>>>>>>> about hybrid scheme where the first args are passed as arguments and 
>>>>>>>>>> others (if any) are gathered into a record?
>>>>>>>>>
>>>>>>>>> I'm confused, maybe I misunderstand the problem. The parallel 
>>>>>>>>> function arguments need to go from the main thread to the workers 
>>>>>>>>> somehow, I don't see how this is done w/o a record. This patch makes 
>>>>>>>>> it explicit though.
>>>>>>>>
>>>>>>>> Pass it in a record for workers only? And use a hybrid scheme for all 
>>>>>>>> other parallel regions.
>>>>>>>
>>>>>>> I still do not follow. What does it mean for workers only? What is a 
>>>>>>> hybrid scheme? And, probably most importantly, how would we not 
>>>>>>> eventually put everything into a record anyway?
>>>>>>
>>>>>> On the host you don’t need to put everything into a record, especially 
>>>>>> for small parallel regions. Pass some first args in registers and only 
>>>>>> the remaining args gather into the record. For workers just pass all 
>>>>>> args in the record.
>>>>>
>>>>> Could you please respond to my question so we make progress here. We 
>>>>> *always* have to pass things in a record, do you agree?
>>>>
>>>> On the GPU device, yes. And I'm absolutely fine with packing args for the 
>>>> GPU device. But the patch packs the args not only for the GPU devices but 
>>>> also for the host and other devices which may not require 
>>>> packing/unpacking. For such devices/host better to avoid packing/unpacking 
>>>> as it introduces overhead in many cases.
>>>
>>> Hi Alexey,
>>>
>>> Wouldn't you always need to pack to pass the arguments to the outlined 
>>> function? What is the benefit of avoiding packing the arguments in the 
>>> runtime call, if then you have to pack them for the outlined function?
>>>
>>> I would really appreciate an example, since I am just getting an 
>>> understanding of OpenMP in LLVM.
>>>
>>> Thanks!
>>
>> Hi, generally speaking, no, you don't need to pack them. Initially, we 
>> packed/unpacked args, but then decided not to do it.
>> Here is an example:
>>
>>   int a, b;
>>   #pragma omp parallel
>>    printf("%d %d\n", a, b);
>>
>> What we generate currently is something like this:
>>
>>   %a = alloca i32
>>   %b = alloca i32
>>   call __kmpc_fork_call(..., @outlined, %a, %b)
>>   ...
>>    internal @outlined(i32 *%a, i32 *%b) {
>>    printf(....);
>>    }
>>
>> `__kmpc_fork_call` inside calls `@outlined` function with the passed args.
>
> While on the user facing side this does not pack the arguments, we still do, 
> and that is the point. In the runtime `__kmp_fork_call` we do
>
>   for (i = argc - 1; i >= 0; --i)
>      *argv++ = va_arg(kmp_va_deref(ap), void *);  
>
> which packs the variadic arguments into a buffer. So eventually we have to 
> walk them and store them into a consecutive buffer. What
> happens before is that we pass some of them in registers but at the end of 
> the day we put them in memory. After all, that is what
> `extern int __kmp_invoke_microtask(microtask_t pkfn, int gtid, int npr, int 
> argc, void *argv[], ...`
> expects.

Ok, then it should be good for the host too. Would be good if you could check 
that it really does not affect the performance and then land it if so.

> If you believe we do not pack/unpack on the host, please walk me through 
> that. As far as I can tell, the "user facing side" might
> look like variadic calls all the way but that is not what is happening in the 
> runtime. Thus, there is no apparent reason to complicate
> the scheme. I'm also happy if you have timing results that indicate otherwise.




Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D102107/new/

https://reviews.llvm.org/D102107

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D102107: [OpenMP] Codegen aggregate for outlined function captures

Reply via email to