Hi! As I'm touching areas of GCC here, that I have no noteworthy experience with (IPA optimizations, cgraph), I'm asking for your help. Thanks!
This is primarily to implement a better "avoid offloading" policy for un-parallelized OpenACC kernels constructs with nvptx offloading, <http://news.gmane.org/find-root.php?message_id=%3C87twlf7ego.fsf%40hertz.schwinge.homeip.net%3E>, but I imagine that potentially also any non-nvptx offloading can benefit from this, by better parameterization of GCC's optimization passes. For this consideration, simplified, "offloading" means that: int main() { [block 1] #pragma omp target { [block 2] } [block 3] } ... is re-written into something like: void main_offloaded() { [block 2] } int main() { [block 1] GOMP_target(&main_offloaded); [block 3] } ..., and the code of main_offloaded is then not run on the CPU but is in GOMP_target launched to execute on an offloading device (not actually important for this consideration). GOMP_target, described by gcc/omp-builtins.def:BUILT_IN_GOMP_TARGET, is implemented externally to the compiler (in libgomp), but we "control" its implementation, and so we're free to have the compiler make certain assumptions about its behavior. (I guess) due to it being passed to the intermediary GOMP_target call, the address of main_offloaded "escapes". But, as we know how/what for GOMP_target is using it, we should be able to (teach the compiler to) handle it like a "static" function -- I hope. For reference, I'm assuming the example above should look/work a bit like the following example: static __attribute__((noinline)) int fun_s(int x) { return x + 1; } int main() { return fun_s(10); } ..., where I do observe that, for example, main's NODE_FREQUENCY_EXECUTED_ONCE is propagated to fun_s. (For main initialized in gcc/predict.c:compute_function_frequency, and the propagation to fun_s then happens in gcc/ipa-profile.c:ipa_propagate_frequency.) See test.c.067i.profile_estimate: "Node foo_s promoted to executed once". However, this node frequency propagation does not currently happen for main_offloaded. Looking into this for a bit, I think I have identified two issues. IPA/cgraph does not consider main_offloaded to be "local" (as in struct cgraph_local_info; handled/set in gcc/ipa-visibility.c:cgraph_node::local_p), so does not even attempt to do such optimizations. (But, per my comment above, we should actually be able to treat main_offloaded at least similar to a "static" function.) If I hack gcc/ipa-visibility.c:cgraph_node::local_p to forcefully mark main_offloaded as "local", the node frequency propagation still doesn't work correctly, because no callers (cgraph_edge) have been registered for the respective cgraph_node, so gcc/ipa-profile.c:ipa_propagate_frequency doesn't have the data available that it needs in order to set the ipa_propagate_frequency_data, which later will be used to set the node frequency. As it already has some support for gomp_parallel and gomp_task (which I understand to have similar semantics in that regard, comparing to gomp_target), maybe extending gcc/cgraphbuild.c will help with the latter issue? (That bit of code handling gomp_* looks like its purpose is to "pass through" function addresses, such as main_offloaded, which are called through intermediary functions, such as GOMP_target, but I have not yet confirmed if that's exactly what's being done there.) What to do about the cgraph_node "local" bit for main_offloaded? Am I generally on the right track? Grüße Thomas