Hi,
On 10/14/25 18:28, Tobias Burnus wrote:
Josef Melcr wrote:
DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed",
BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
- ATTR_NOTHROW_LIST)
+ ATTR_CALLBACK_OACC_LIST)
Thus, I wonder whether this should be skipped - and handled
in lock step with the OpenMP/omp target support.
Oh, I totally missed that, thank you. Since the kernel has noclone,
the extra edges shouldn't really disrupt it, but I agree it should be
handled at once. Sorry about that. Should I exclude it and/or resend?
For me, just excluding it is enough. (You might want to send an email
when you committed this patch - and you could attach the final commit
to that email.)
Okay. I was going to send that email anyway, I am just a bit anxious
since it's my first real commit, so I am just making sure. :)
* * *
From the other email thread:
The propagation is not considered profitable enough
OK, missed that. I guess for real-world code, it will work (and
possibly some tuning will make it work also without).
* * *
It might not, I've had issues with getting the pass to propagate before,
but I thought those were already fixed. Jakub's idea from the other
thread might make it work though.
could we bring the device code into LTO to optimize it further?
If you talk about optimizations on the whole program: the host LTO is
already run with the to-be-offloaded functions, but the
device-function stream out happens rather early.
If you are talking about device-side LTO: This requires some
reorganization of how it is handled and compiling the device-side
libraries as thick libraries (non-LTO and LTO code). Additionally, it
requires that the device-side linker supports the linker plugin. —
There are rather explicit plans on our side (BayLibre) to get this
working (at least for AMD GPUs/gcn) – as mentioned during the
offloading BoF, but it will take a while.
Otherwise, the device side already already always sees all offload
functions – as all (host) TU with offload LTO data are feed into the
same device-side lto1 compiler; "just" the libraries (libgfortran,
libgomp, libstdc++, …) are missing the LTO data (and some data we
might hide from LTO). — We also need to get rid of the force_output
flag; not because they shouldn't be output – but because when it is
set, some legit optimizations are disabled.
Oh I see. So till then, delaying the output does seem like the best
option. Thank you for clearing that up :)
Tobias
Best regards,
Josef