jdoerfert added a comment. In D102107#2867693 <https://reviews.llvm.org/D102107#2867693>, @ABataev wrote:
> In D102107#2867670 <https://reviews.llvm.org/D102107#2867670>, @josemonsalve2 > wrote: > >> In D102107#2867417 <https://reviews.llvm.org/D102107#2867417>, @ABataev >> wrote: >> >>> In D102107#2867382 <https://reviews.llvm.org/D102107#2867382>, @jdoerfert >>> wrote: >>> >>>> In D102107#2832740 <https://reviews.llvm.org/D102107#2832740>, @ABataev >>>> wrote: >>>> >>>>> In D102107#2832286 <https://reviews.llvm.org/D102107#2832286>, @jdoerfert >>>>> wrote: >>>>> >>>>>> In D102107#2824581 <https://reviews.llvm.org/D102107#2824581>, @ABataev >>>>>> wrote: >>>>>> >>>>>>> In D102107#2823706 <https://reviews.llvm.org/D102107#2823706>, >>>>>>> @jdoerfert wrote: >>>>>>> >>>>>>>> In D102107#2821976 <https://reviews.llvm.org/D102107#2821976>, >>>>>>>> @ABataev wrote: >>>>>>>> >>>>>>>>> We used this kind of codegen initially but later found out that it >>>>>>>>> causes a large overhead when gathering pointers into a record. What >>>>>>>>> about hybrid scheme where the first args are passed as arguments and >>>>>>>>> others (if any) are gathered into a record? >>>>>>>> >>>>>>>> I'm confused, maybe I misunderstand the problem. The parallel function >>>>>>>> arguments need to go from the main thread to the workers somehow, I >>>>>>>> don't see how this is done w/o a record. This patch makes it explicit >>>>>>>> though. >>>>>>> >>>>>>> Pass it in a record for workers only? And use a hybrid scheme for all >>>>>>> other parallel regions. >>>>>> >>>>>> I still do not follow. What does it mean for workers only? What is a >>>>>> hybrid scheme? And, probably most importantly, how would we not >>>>>> eventually put everything into a record anyway? >>>>> >>>>> On the host you don’t need to put everything into a record, especially >>>>> for small parallel regions. Pass some first args in registers and only >>>>> the remaining args gather into the record. For workers just pass all args >>>>> in the record. >>>> >>>> Could you please respond to my question so we make progress here. We >>>> *always* have to pass things in a record, do you agree? >>> >>> On the GPU device, yes. And I'm absolutely fine with packing args for the >>> GPU device. But the patch packs the args not only for the GPU devices but >>> also for the host and other devices which may not require >>> packing/unpacking. For such devices/host better to avoid packing/unpacking >>> as it introduces overhead in many cases. >> >> Hi Alexey, >> >> Wouldn't you always need to pack to pass the arguments to the outlined >> function? What is the benefit of avoiding packing the arguments in the >> runtime call, if then you have to pack them for the outlined function? >> >> I would really appreciate an example, since I am just getting an >> understanding of OpenMP in LLVM. >> >> Thanks! > > Hi, generally speaking, no, you don't need to pack them. Initially, we > packed/unpacked args, but then decided not to do it. > Here is an example: > > int a, b; > #pragma omp parallel > printf("%d %d\n", a, b); > > What we generate currently is something like this: > > %a = alloca i32 > %b = alloca i32 > call __kmpc_fork_call(..., @outlined, %a, %b) > ... > internal @outlined(i32 *%a, i32 *%b) { > printf(....); > } > > `__kmpc_fork_call` inside calls `@outlined` function with the passed args. While on the user facing side this does not pack the arguments, we still do, and that is the point. In the runtime `__kmp_fork_call` we do for (i = argc - 1; i >= 0; --i) *argv++ = va_arg(kmp_va_deref(ap), void *); which packs the variadic arguments into a buffer. So eventually we have to walk them and store them into a consecutive buffer. What happens before is that we pass some of them in registers but at the end of the day we put them in memory. After all, that is what `extern int __kmp_invoke_microtask(microtask_t pkfn, int gtid, int npr, int argc, void *argv[], ...` expects. If you believe we do not pack/unpack on the host, please walk me through that. As far as I can tell, the "user facing side" might look like variadic calls all the way but that is not what is happening in the runtime. Thus, there is no apparent reason to complicate the scheme. I'm also happy if you have timing results that indicate otherwise. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D102107/new/ https://reviews.llvm.org/D102107 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits