Hi!

As I'm touching areas of GCC here, that I have no noteworthy experience
with (IPA optimizations, cgraph), I'm asking for your help.  Thanks!

This is primarily to implement a better "avoid offloading" policy for
un-parallelized OpenACC kernels constructs with nvptx offloading,
<http://news.gmane.org/find-root.php?message_id=%3C87twlf7ego.fsf%40hertz.schwinge.homeip.net%3E>,
but I imagine that potentially also any non-nvptx offloading can benefit
from this, by better parameterization of GCC's optimization passes.

For this consideration, simplified, "offloading" means that:

    int main()
    {
      [block 1]
      #pragma omp target
        {
          [block 2]
        }
      [block 3]
    }

... is re-written into something like:

    void main_offloaded()
    {
      [block 2]
    }

    int main()
    {
      [block 1]
      GOMP_target(&main_offloaded);
      [block 3]
    }

..., and the code of main_offloaded is then not run on the CPU but is in
GOMP_target launched to execute on an offloading device (not actually
important for this consideration).

GOMP_target, described by gcc/omp-builtins.def:BUILT_IN_GOMP_TARGET, is
implemented externally to the compiler (in libgomp), but we "control" its
implementation, and so we're free to have the compiler make certain
assumptions about its behavior.

(I guess) due to it being passed to the intermediary GOMP_target call,
the address of main_offloaded "escapes".  But, as we know how/what for
GOMP_target is using it, we should be able to (teach the compiler to)
handle it like a "static" function -- I hope.

For reference, I'm assuming the example above should look/work a bit like
the following example:

    static __attribute__((noinline)) int fun_s(int x)
    {
      return x + 1;
    }
    
    int main()
    {
      return fun_s(10);
    }

..., where I do observe that, for example, main's
NODE_FREQUENCY_EXECUTED_ONCE is propagated to fun_s.  (For main
initialized in gcc/predict.c:compute_function_frequency, and the
propagation to fun_s then happens in
gcc/ipa-profile.c:ipa_propagate_frequency.)  See
test.c.067i.profile_estimate: "Node foo_s promoted to executed once".
However, this node frequency propagation does not currently happen for
main_offloaded.  Looking into this for a bit, I think I have identified
two issues.

IPA/cgraph does not consider main_offloaded to be "local" (as in struct
cgraph_local_info; handled/set in
gcc/ipa-visibility.c:cgraph_node::local_p), so does not even attempt to
do such optimizations.  (But, per my comment above, we should actually be
able to treat main_offloaded at least similar to a "static" function.)

If I hack gcc/ipa-visibility.c:cgraph_node::local_p to forcefully mark
main_offloaded as "local", the node frequency propagation still doesn't
work correctly, because no callers (cgraph_edge) have been registered for
the respective cgraph_node, so gcc/ipa-profile.c:ipa_propagate_frequency
doesn't have the data available that it needs in order to set the
ipa_propagate_frequency_data, which later will be used to set the node
frequency.

As it already has some support for gomp_parallel and gomp_task (which I
understand to have similar semantics in that regard, comparing to
gomp_target), maybe extending gcc/cgraphbuild.c will help with the latter
issue?  (That bit of code handling gomp_* looks like its purpose is to
"pass through" function addresses, such as main_offloaded, which are
called through intermediary functions, such as GOMP_target, but I have
not yet confirmed if that's exactly what's being done there.)

What to do about the cgraph_node "local" bit for main_offloaded?

Am I generally on the right track?


Grüße
 Thomas

Reply via email to