[PATCH] D59474: [OpenMP 5.0] Codegen support for user-defined mappers

Alexey Bataev via Phabricator via cfe-commits Wed, 26 Jun 2019 13:20:47 -0700

ABataev added inline comments.


================
Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:8739
+///     // For each component specified by this mapper:
+///     if (currentComponent.hasMapper())
+///       (*currentComponent.Mapper())(rt_mapper_handle, arg_base, arg_begin,
----------------
lildmh wrote:
> ABataev wrote:
> > lildmh wrote:
> > > ABataev wrote:
> > > > lildmh wrote:
> > > > > ABataev wrote:
> > > > > > lildmh wrote:
> > > > > > > ABataev wrote:
> > > > > > > > lildmh wrote:
> > > > > > > > > ABataev wrote:
> > > > > > > > > > lildmh wrote:
> > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > Currently `currentComponent` is generated by 
> > > > > > > > > > > > > > > > the compiler. But can we instead pass this data 
> > > > > > > > > > > > > > > > as an extra parameter to this `omp_mapper` 
> > > > > > > > > > > > > > > > function.
> > > > > > > > > > > > > > > Emm, I think this scheme will be very difficult 
> > > > > > > > > > > > > > > and inefficient. If we pass components as an 
> > > > > > > > > > > > > > > argument of `omp_mapper` function, it means that 
> > > > > > > > > > > > > > > the runtime needs to generate all components 
> > > > > > > > > > > > > > > related to a map clause. I don't think the 
> > > > > > > > > > > > > > > runtime is able to do that efficiently. On the 
> > > > > > > > > > > > > > > other hand, in the current scheme, these 
> > > > > > > > > > > > > > > components are naturally generated by the 
> > > > > > > > > > > > > > > compiler, and the runtime only needs to know the 
> > > > > > > > > > > > > > > base pointer, pointer, type, size. etc.
> > > > > > > > > > > > > > With the current scheme, we may end with the code 
> > > > > > > > > > > > > > blowout. We need to generate very similar code for 
> > > > > > > > > > > > > > different types and variables. The worst thing here 
> > > > > > > > > > > > > > is that we will be unable to optimize this huge 
> > > > > > > > > > > > > > amount of code because the codegen relies on the 
> > > > > > > > > > > > > > runtime functions and the code cannot be inlined. 
> > > > > > > > > > > > > > That's why I would like to move as much as possible 
> > > > > > > > > > > > > > code to the runtime rather than to emit it in the 
> > > > > > > > > > > > > > compiler. 
> > > > > > > > > > > > > I understand your concerns. I think this is the best 
> > > > > > > > > > > > > we can do right now.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The most worrisome case will be when we have nested 
> > > > > > > > > > > > > mappers within each other. In this case, a mapper 
> > > > > > > > > > > > > function will call another mapper function. We can 
> > > > > > > > > > > > > inline the inner mapper functions in this scenario, 
> > > > > > > > > > > > > so that these mapper function can be properly 
> > > > > > > > > > > > > optimized. As a result, I think the performance 
> > > > > > > > > > > > > should be fine.
> > > > > > > > > > > > Instead, we can use indirect function calls passed in 
> > > > > > > > > > > > the array to the runtime. Do you think it is going to 
> > > > > > > > > > > > be slower? In your current scheme, we generate many 
> > > > > > > > > > > > runtime calls instead. Could you try to estimate the 
> > > > > > > > > > > > number of calls in cases if we'll call the mappers 
> > > > > > > > > > > > through the indirect function calls and in your cuurent 
> > > > > > > > > > > > scheme, where we need to call the runtime functions 
> > > > > > > > > > > > many times in each particular mapper?
> > > > > > > > > > > Hi Alexey,
> > > > > > > > > > > 
> > > > > > > > > > > Sorry I don't understand your idea. What indirect 
> > > > > > > > > > > function calls do you propose to be passed to the 
> > > > > > > > > > > runtime? What are these functions supposed to do?
> > > > > > > > > > > 
> > > > > > > > > > > The number of function calls will be exactly equal to the 
> > > > > > > > > > > number of components mapped, no matter whether there are 
> > > > > > > > > > > nested mappers or not. The number of components depend on 
> > > > > > > > > > > the program. E.g., if we map a large array section, then 
> > > > > > > > > > > there will be many more function calls.
> > > > > > > > > > I mean the pointers to the mapper function, generated by 
> > > > > > > > > > the compiler. In your comment, it is `c.Mapper()`
> > > > > > > > > If we pass nested mapper functions to the runtime, I think it 
> > > > > > > > > will slow down execution because of the extra level of 
> > > > > > > > > indirect function calls. E.g., the runtime will call 
> > > > > > > > > `omp_mapper1`, which calls the runtime back, which calls 
> > > > > > > > > `omp_mapper2`, .... This can result in a deep call stack.
> > > > > > > > > 
> > > > > > > > > I think the current implementation will be more efficient, 
> > > > > > > > > which doesn't pass nested mappers to the runtime. One call to 
> > > > > > > > > the outer most mapper function will have all data mapping 
> > > > > > > > > done. The call stack will be 2 level deep (the first level is 
> > > > > > > > > the mapper function, and the second level is 
> > > > > > > > > `__tgt_push_mapper_component`) in this case from the runtime. 
> > > > > > > > > There are also more compiler optimization space when we 
> > > > > > > > > inline all nested mapper functions.
> > > > > > > > Yes, if we leave it as is. But if instead of the bunch unique 
> > > > > > > > functions we'll have the common one, that accept list if 
> > > > > > > > indirect pointers to functions additionally, and move it to the 
> > > > > > > > runtime library, we won't need those 2 functions we have 
> > > > > > > > currently. We'll have full access to the mapping data vector in 
> > > > > > > > the runtime library and won't need to use those 2 accessors we 
> > > > > > > > have currently. Instead, we'll need just one runtime functions, 
> > > > > > > > which implements the whole mapping logic. We still need to call 
> > > > > > > > it recursively, but I assume the number of calls will remain 
> > > > > > > > the same as in the current scheme. Did you understand the idea? 
> > > > > > > > If yes, it would good if you coild try to estimate the number 
> > > > > > > > of function calls in current scheme and in this new scheme to 
> > > > > > > > estimate possible pros and cons.
> > > > > > > Hi Alexey,
> > > > > > > 
> > > > > > > Could you give an example for this scheme? 1) I don't understand 
> > > > > > > how the mapper function can have full access to the mapping data 
> > > > > > > vector without providing these 2 accessors. 2) I don't think it 
> > > > > > > is possible to have a common function instead of bunch of unique 
> > > > > > > functions for each mapper declared.
> > > > > > Hi Lingda, something like this.
> > > > > > ```
> > > > > > void __tgt_mapper(void *base, void *begin, size_t size, int64_t 
> > > > > > type, auto components[]) {
> > > > > >   // Allocate space for an array section first.
> > > > > >   if (size > 1 && !maptype.IsDelete)
> > > > > >      <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
> > > > > >    
> > > > > >   // Map members.
> > > > > >   for (unsigned i = 0; i < size; i++) {
> > > > > >      // For each component specified by this mapper:
> > > > > >      for (auto c : components) {
> > > > > >        if (c.hasMapper())
> > > > > >          (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, 
> > > > > > c.arg_type);
> > > > > >        else
> > > > > >          <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
> > > > > >      }
> > > > > >   }
> > > > > >   // Delete the array section.
> > > > > >   if (size > 1 && maptype.IsDelete)
> > > > > >     <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
> > > > > > }
> > > > > > 
> > > > > > void <type>.mapper(void *base, void *begin, size_t size, int64_t 
> > > > > > type) {
> > > > > >  auto sub_components[] = {...};
> > > > > >  __tgt_mapper(base, begin, size, type, sub_components);
> > > > > > }
> > > > > > ```
> > > > > > 
> > > > > Hi Alexey,
> > > > > 
> > > > > I don't think this scheme is more efficient than the current scheme. 
> > > > > My reasons are:
> > > > > 
> > > > > 1) Most code here is essentially to generate `components`, i.e., we 
> > > > > need to generate `c.arg_base, c.arg_begin, c.arg_size, c.arg_type` 
> > > > > for each `c` in `components`, so there will still be a lot of code in 
> > > > > `<type>.mapper`. It will not reduce the mapper function code, i.e., 
> > > > > we will still have a bunch of unique mapper functions.
> > > > > 
> > > > > 2) This scheme will prevent a lot of compiler optimization from 
> > > > > happening. In reality, a lot of computation should be redundant. 
> > > > > E.g., for two components `c1` and `c2`, `c1`'s base may be the same 
> > > > > as `c2`'s begin, so the compiler will be able to eliminate these 
> > > > > reduction computation, especially when we inline all nested mapper 
> > > > > functions together. If we move these computation into the runtime, 
> > > > > the compiler will not be able to do such optimization.
> > > > > 
> > > > > 3) In terms of the number of `push` function calls, this scheme has 
> > > > > the exact same number of calls as the current scheme, so I don't 
> > > > > think this scheme can bring performance benefits. The scheme should 
> > > > > perform worse than the current scheme, because it reduces the 
> > > > > opportunities of compiler optimization as mentioned above.
> > > > Hi Lingda, I'm trying to simplify the code generated by clang and avoid 
> > > > some unnecessary code duplications. If the complexity of this scheme is 
> > > > the same as proposed by you, I would prefer to use this scheme unless 
> > > > there are some other opinions.
> > > > 1. It is not a problem. This code is unique and is not duplicated in 
> > > > the different mappers.
> > > > 2. Inlining is no solution here. We still generate to much code, which 
> > > > is almost the same in many cases and it will lead to very ineffective 
> > > > codegen because we still end up with a lot of almost the same code. 
> > > > This also might lead to poor performance.
> > > > 3. Yes, the number of pushes is always the same, in all possible 
> > > > schemes. It would be good to compare somehow the performance of both 
> > > > schemes, at least preliminary.
> > > > 
> > > > Also, this solution reduces the number of required runtime functions, 
> > > > instead of 2 we need just 1 and, thus, we need to make fewer runtime 
> > > > functions calls.
> > > > 
> > > > I think it would better to propose this scheme as an alternate design 
> > > > and discuss it in the OpenMP telecon. What do you think? Or we can try 
> > > > to discuss it in the offline mode via the e-mail with other members.
> > > > I'm not trying to convince you to implement this scheme right now, but 
> > > > it would be good to discuss it. Maybe it will lead to some better ideas 
> > > > from others?
> > > Hi Alexey,
> > > 
> > > I still prefer the current scheme, because:
> > > 1) I don't like recursive mapper calls, which goes back to my original 
> > > scheme a little bit. I really think inlining can make a big difference 
> > > when we have nested mappers. These compiler optimizations are the keys to 
> > > have better performance for mappers.
> > > 2) I don't think the codegen here is inefficient. Yes there is duplicated 
> > > code across different mapper functions, but why that will lead to poor 
> > > performance?
> > > 3) Although we have 2 runtime functions now, the 
> > > `__tgt_mapper_num_components` is called only once per mapper. It should 
> > > have very negligible performance impact.
> > > 
> > > But if you have a different option, we can discuss it next time in the 
> > > meeting. I do have a time constraint to work on the mapper 
> > > implementation. I'll no longer work in this project starting this 
> > > September, and I have about 30% of my time working on it until then.
> > Lingda, 
> > 1. We have recursive (actually, not recursive, because you cannot use types 
> > recursively) mappers calls anyway, it is nature of struсtures/classes.
> > 2. We have a lot of similar code. And I'm not sure that it can be optimized 
> > out.
> > 3. Yes, but it means that we have n extra runtime calls, where n is the 
> > number of branches in the structure/class tree.
> > 
> > I see :(. I understand your concern. In this case, we could try to discuss 
> > it offline, in the mailing list, to make it a little bit faster. We just 
> > need to hear other opinions on this matter, maybe there are some other pros 
> > and cons for these schemes.
> Hi Alexey,
> 
> Sure, let's discuss this in the mailing list. I'll summarize it and send it 
> to the mailing list later.
> 
> > We have recursive (actually, not recursive, because you cannot use types 
> > recursively) mappers calls anyway, it is nature of struсtures/classes.
> We won't have recursive calls with inlining.
> 
> > We have a lot of similar code. And I'm not sure that it can be optimized 
> > out.
> I think it's even harder to optimized these code out when we move them into 
> the runtime.
> 
> > Yes, but it means that we have n extra runtime calls, where n is the number 
> > of branches in the structure/class tree.
> I don't quite understand. It's still equal to the number of mappers in any 
> case.
> Sure, let's discuss this in the mailing list. I'll summarize it and send it 
> to the mailing list later.

Good, thanks!

> We won't have recursive calls with inlining.

We won't have recursive calls anyway (recursive types are not allowed). Plus, 
I'm not sure that inlining is the best option here. We have a lot of code for 
each mapper and I'm not sure that the optimizer will be able to squash it 
effectively.

> I think it's even harder to optimized these code out when we move them into 
> the runtime.

Definitely not, unless we use LTO or inlined runtime.



CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59474/new/

https://reviews.llvm.org/D59474



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D59474: [OpenMP 5.0] Codegen support for user-defined mappers

Reply via email to