lildmh marked an inline comment as done. lildmh added inline comments.
================ Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:8739 +/// // For each component specified by this mapper: +/// if (currentComponent.hasMapper()) +/// (*currentComponent.Mapper())(rt_mapper_handle, arg_base, arg_begin, ---------------- ABataev wrote: > lildmh wrote: > > ABataev wrote: > > > lildmh wrote: > > > > ABataev wrote: > > > > > lildmh wrote: > > > > > > ABataev wrote: > > > > > > > lildmh wrote: > > > > > > > > ABataev wrote: > > > > > > > > > lildmh wrote: > > > > > > > > > > ABataev wrote: > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > Currently `currentComponent` is generated by the > > > > > > > > > > > > > > > compiler. But can we instead pass this data as an > > > > > > > > > > > > > > > extra parameter to this `omp_mapper` function. > > > > > > > > > > > > > > Emm, I think this scheme will be very difficult and > > > > > > > > > > > > > > inefficient. If we pass components as an argument > > > > > > > > > > > > > > of `omp_mapper` function, it means that the runtime > > > > > > > > > > > > > > needs to generate all components related to a map > > > > > > > > > > > > > > clause. I don't think the runtime is able to do > > > > > > > > > > > > > > that efficiently. On the other hand, in the current > > > > > > > > > > > > > > scheme, these components are naturally generated by > > > > > > > > > > > > > > the compiler, and the runtime only needs to know > > > > > > > > > > > > > > the base pointer, pointer, type, size. etc. > > > > > > > > > > > > > With the current scheme, we may end with the code > > > > > > > > > > > > > blowout. We need to generate very similar code for > > > > > > > > > > > > > different types and variables. The worst thing here > > > > > > > > > > > > > is that we will be unable to optimize this huge > > > > > > > > > > > > > amount of code because the codegen relies on the > > > > > > > > > > > > > runtime functions and the code cannot be inlined. > > > > > > > > > > > > > That's why I would like to move as much as possible > > > > > > > > > > > > > code to the runtime rather than to emit it in the > > > > > > > > > > > > > compiler. > > > > > > > > > > > > I understand your concerns. I think this is the best we > > > > > > > > > > > > can do right now. > > > > > > > > > > > > > > > > > > > > > > > > The most worrisome case will be when we have nested > > > > > > > > > > > > mappers within each other. In this case, a mapper > > > > > > > > > > > > function will call another mapper function. We can > > > > > > > > > > > > inline the inner mapper functions in this scenario, so > > > > > > > > > > > > that these mapper function can be properly optimized. > > > > > > > > > > > > As a result, I think the performance should be fine. > > > > > > > > > > > Instead, we can use indirect function calls passed in the > > > > > > > > > > > array to the runtime. Do you think it is going to be > > > > > > > > > > > slower? In your current scheme, we generate many runtime > > > > > > > > > > > calls instead. Could you try to estimate the number of > > > > > > > > > > > calls in cases if we'll call the mappers through the > > > > > > > > > > > indirect function calls and in your cuurent scheme, where > > > > > > > > > > > we need to call the runtime functions many times in each > > > > > > > > > > > particular mapper? > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > Sorry I don't understand your idea. What indirect function > > > > > > > > > > calls do you propose to be passed to the runtime? What are > > > > > > > > > > these functions supposed to do? > > > > > > > > > > > > > > > > > > > > The number of function calls will be exactly equal to the > > > > > > > > > > number of components mapped, no matter whether there are > > > > > > > > > > nested mappers or not. The number of components depend on > > > > > > > > > > the program. E.g., if we map a large array section, then > > > > > > > > > > there will be many more function calls. > > > > > > > > > I mean the pointers to the mapper function, generated by the > > > > > > > > > compiler. In your comment, it is `c.Mapper()` > > > > > > > > If we pass nested mapper functions to the runtime, I think it > > > > > > > > will slow down execution because of the extra level of indirect > > > > > > > > function calls. E.g., the runtime will call `omp_mapper1`, > > > > > > > > which calls the runtime back, which calls `omp_mapper2`, .... > > > > > > > > This can result in a deep call stack. > > > > > > > > > > > > > > > > I think the current implementation will be more efficient, > > > > > > > > which doesn't pass nested mappers to the runtime. One call to > > > > > > > > the outer most mapper function will have all data mapping done. > > > > > > > > The call stack will be 2 level deep (the first level is the > > > > > > > > mapper function, and the second level is > > > > > > > > `__tgt_push_mapper_component`) in this case from the runtime. > > > > > > > > There are also more compiler optimization space when we inline > > > > > > > > all nested mapper functions. > > > > > > > Yes, if we leave it as is. But if instead of the bunch unique > > > > > > > functions we'll have the common one, that accept list if indirect > > > > > > > pointers to functions additionally, and move it to the runtime > > > > > > > library, we won't need those 2 functions we have currently. We'll > > > > > > > have full access to the mapping data vector in the runtime > > > > > > > library and won't need to use those 2 accessors we have > > > > > > > currently. Instead, we'll need just one runtime functions, which > > > > > > > implements the whole mapping logic. We still need to call it > > > > > > > recursively, but I assume the number of calls will remain the > > > > > > > same as in the current scheme. Did you understand the idea? If > > > > > > > yes, it would good if you coild try to estimate the number of > > > > > > > function calls in current scheme and in this new scheme to > > > > > > > estimate possible pros and cons. > > > > > > Hi Alexey, > > > > > > > > > > > > Could you give an example for this scheme? 1) I don't understand > > > > > > how the mapper function can have full access to the mapping data > > > > > > vector without providing these 2 accessors. 2) I don't think it is > > > > > > possible to have a common function instead of bunch of unique > > > > > > functions for each mapper declared. > > > > > Hi Lingda, something like this. > > > > > ``` > > > > > void __tgt_mapper(void *base, void *begin, size_t size, int64_t type, > > > > > auto components[]) { > > > > > // Allocate space for an array section first. > > > > > if (size > 1 && !maptype.IsDelete) > > > > > <push>(base, begin, size*sizeof(Ty), clearToFrom(type)); > > > > > > > > > > // Map members. > > > > > for (unsigned i = 0; i < size; i++) { > > > > > // For each component specified by this mapper: > > > > > for (auto c : components) { > > > > > if (c.hasMapper()) > > > > > (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, > > > > > c.arg_type); > > > > > else > > > > > <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type); > > > > > } > > > > > } > > > > > // Delete the array section. > > > > > if (size > 1 && maptype.IsDelete) > > > > > <push>(base, begin, size*sizeof(Ty), clearToFrom(type)); > > > > > } > > > > > > > > > > void <type>.mapper(void *base, void *begin, size_t size, int64_t > > > > > type) { > > > > > auto sub_components[] = {...}; > > > > > __tgt_mapper(base, begin, size, type, sub_components); > > > > > } > > > > > ``` > > > > > > > > > Hi Alexey, > > > > > > > > I don't think this scheme is more efficient than the current scheme. My > > > > reasons are: > > > > > > > > 1) Most code here is essentially to generate `components`, i.e., we > > > > need to generate `c.arg_base, c.arg_begin, c.arg_size, c.arg_type` for > > > > each `c` in `components`, so there will still be a lot of code in > > > > `<type>.mapper`. It will not reduce the mapper function code, i.e., we > > > > will still have a bunch of unique mapper functions. > > > > > > > > 2) This scheme will prevent a lot of compiler optimization from > > > > happening. In reality, a lot of computation should be redundant. E.g., > > > > for two components `c1` and `c2`, `c1`'s base may be the same as `c2`'s > > > > begin, so the compiler will be able to eliminate these reduction > > > > computation, especially when we inline all nested mapper functions > > > > together. If we move these computation into the runtime, the compiler > > > > will not be able to do such optimization. > > > > > > > > 3) In terms of the number of `push` function calls, this scheme has the > > > > exact same number of calls as the current scheme, so I don't think this > > > > scheme can bring performance benefits. The scheme should perform worse > > > > than the current scheme, because it reduces the opportunities of > > > > compiler optimization as mentioned above. > > > Hi Lingda, I'm trying to simplify the code generated by clang and avoid > > > some unnecessary code duplications. If the complexity of this scheme is > > > the same as proposed by you, I would prefer to use this scheme unless > > > there are some other opinions. > > > 1. It is not a problem. This code is unique and is not duplicated in the > > > different mappers. > > > 2. Inlining is no solution here. We still generate to much code, which is > > > almost the same in many cases and it will lead to very ineffective > > > codegen because we still end up with a lot of almost the same code. This > > > also might lead to poor performance. > > > 3. Yes, the number of pushes is always the same, in all possible schemes. > > > It would be good to compare somehow the performance of both schemes, at > > > least preliminary. > > > > > > Also, this solution reduces the number of required runtime functions, > > > instead of 2 we need just 1 and, thus, we need to make fewer runtime > > > functions calls. > > > > > > I think it would better to propose this scheme as an alternate design and > > > discuss it in the OpenMP telecon. What do you think? Or we can try to > > > discuss it in the offline mode via the e-mail with other members. > > > I'm not trying to convince you to implement this scheme right now, but it > > > would be good to discuss it. Maybe it will lead to some better ideas from > > > others? > > Hi Alexey, > > > > I still prefer the current scheme, because: > > 1) I don't like recursive mapper calls, which goes back to my original > > scheme a little bit. I really think inlining can make a big difference when > > we have nested mappers. These compiler optimizations are the keys to have > > better performance for mappers. > > 2) I don't think the codegen here is inefficient. Yes there is duplicated > > code across different mapper functions, but why that will lead to poor > > performance? > > 3) Although we have 2 runtime functions now, the > > `__tgt_mapper_num_components` is called only once per mapper. It should > > have very negligible performance impact. > > > > But if you have a different option, we can discuss it next time in the > > meeting. I do have a time constraint to work on the mapper implementation. > > I'll no longer work in this project starting this September, and I have > > about 30% of my time working on it until then. > Lingda, > 1. We have recursive (actually, not recursive, because you cannot use types > recursively) mappers calls anyway, it is nature of struсtures/classes. > 2. We have a lot of similar code. And I'm not sure that it can be optimized > out. > 3. Yes, but it means that we have n extra runtime calls, where n is the > number of branches in the structure/class tree. > > I see :(. I understand your concern. In this case, we could try to discuss it > offline, in the mailing list, to make it a little bit faster. We just need to > hear other opinions on this matter, maybe there are some other pros and cons > for these schemes. Hi Alexey, Sure, let's discuss this in the mailing list. I'll summarize it and send it to the mailing list later. > We have recursive (actually, not recursive, because you cannot use types > recursively) mappers calls anyway, it is nature of struсtures/classes. We won't have recursive calls with inlining. > We have a lot of similar code. And I'm not sure that it can be optimized out. I think it's even harder to optimized these code out when we move them into the runtime. > Yes, but it means that we have n extra runtime calls, where n is the number > of branches in the structure/class tree. I don't quite understand. It's still equal to the number of mappers in any case. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D59474/new/ https://reviews.llvm.org/D59474 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits