ABataev added a comment. In D102107#2874123 <https://reviews.llvm.org/D102107#2874123>, @jdoerfert wrote:
> In D102107#2867693 <https://reviews.llvm.org/D102107#2867693>, @ABataev wrote: > >> In D102107#2867670 <https://reviews.llvm.org/D102107#2867670>, >> @josemonsalve2 wrote: >> >>> In D102107#2867417 <https://reviews.llvm.org/D102107#2867417>, @ABataev >>> wrote: >>> >>>> In D102107#2867382 <https://reviews.llvm.org/D102107#2867382>, @jdoerfert >>>> wrote: >>>> >>>>> In D102107#2832740 <https://reviews.llvm.org/D102107#2832740>, @ABataev >>>>> wrote: >>>>> >>>>>> In D102107#2832286 <https://reviews.llvm.org/D102107#2832286>, >>>>>> @jdoerfert wrote: >>>>>> >>>>>>> In D102107#2824581 <https://reviews.llvm.org/D102107#2824581>, @ABataev >>>>>>> wrote: >>>>>>> >>>>>>>> In D102107#2823706 <https://reviews.llvm.org/D102107#2823706>, >>>>>>>> @jdoerfert wrote: >>>>>>>> >>>>>>>>> In D102107#2821976 <https://reviews.llvm.org/D102107#2821976>, >>>>>>>>> @ABataev wrote: >>>>>>>>> >>>>>>>>>> We used this kind of codegen initially but later found out that it >>>>>>>>>> causes a large overhead when gathering pointers into a record. What >>>>>>>>>> about hybrid scheme where the first args are passed as arguments and >>>>>>>>>> others (if any) are gathered into a record? >>>>>>>>> >>>>>>>>> I'm confused, maybe I misunderstand the problem. The parallel >>>>>>>>> function arguments need to go from the main thread to the workers >>>>>>>>> somehow, I don't see how this is done w/o a record. This patch makes >>>>>>>>> it explicit though. >>>>>>>> >>>>>>>> Pass it in a record for workers only? And use a hybrid scheme for all >>>>>>>> other parallel regions. >>>>>>> >>>>>>> I still do not follow. What does it mean for workers only? What is a >>>>>>> hybrid scheme? And, probably most importantly, how would we not >>>>>>> eventually put everything into a record anyway? >>>>>> >>>>>> On the host you don’t need to put everything into a record, especially >>>>>> for small parallel regions. Pass some first args in registers and only >>>>>> the remaining args gather into the record. For workers just pass all >>>>>> args in the record. >>>>> >>>>> Could you please respond to my question so we make progress here. We >>>>> *always* have to pass things in a record, do you agree? >>>> >>>> On the GPU device, yes. And I'm absolutely fine with packing args for the >>>> GPU device. But the patch packs the args not only for the GPU devices but >>>> also for the host and other devices which may not require >>>> packing/unpacking. For such devices/host better to avoid packing/unpacking >>>> as it introduces overhead in many cases. >>> >>> Hi Alexey, >>> >>> Wouldn't you always need to pack to pass the arguments to the outlined >>> function? What is the benefit of avoiding packing the arguments in the >>> runtime call, if then you have to pack them for the outlined function? >>> >>> I would really appreciate an example, since I am just getting an >>> understanding of OpenMP in LLVM. >>> >>> Thanks! >> >> Hi, generally speaking, no, you don't need to pack them. Initially, we >> packed/unpacked args, but then decided not to do it. >> Here is an example: >> >> int a, b; >> #pragma omp parallel >> printf("%d %d\n", a, b); >> >> What we generate currently is something like this: >> >> %a = alloca i32 >> %b = alloca i32 >> call __kmpc_fork_call(..., @outlined, %a, %b) >> ... >> internal @outlined(i32 *%a, i32 *%b) { >> printf(....); >> } >> >> `__kmpc_fork_call` inside calls `@outlined` function with the passed args. > > While on the user facing side this does not pack the arguments, we still do, > and that is the point. In the runtime `__kmp_fork_call` we do > > for (i = argc - 1; i >= 0; --i) > *argv++ = va_arg(kmp_va_deref(ap), void *); > > which packs the variadic arguments into a buffer. So eventually we have to > walk them and store them into a consecutive buffer. What > happens before is that we pass some of them in registers but at the end of > the day we put them in memory. After all, that is what > `extern int __kmp_invoke_microtask(microtask_t pkfn, int gtid, int npr, int > argc, void *argv[], ...` > expects. Ok, then it should be good for the host too. Would be good if you could check that it really does not affect the performance and then land it if so. > If you believe we do not pack/unpack on the host, please walk me through > that. As far as I can tell, the "user facing side" might > look like variadic calls all the way but that is not what is happening in the > runtime. Thus, there is no apparent reason to complicate > the scheme. I'm also happy if you have timing results that indicate otherwise. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D102107/new/ https://reviews.llvm.org/D102107 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits