On Tue, 2013-05-07 at 12:46 +0200, Richard Biener wrote: > On Tue, May 7, 2013 at 12:42 PM, Richard Biener > <richard.guent...@gmail.com> wrote: > > On Tue, May 7, 2013 at 11:02 AM, Tobias Burnus <bur...@net-b.de> wrote: > >> Richard Biener wrote: > >>> > >>> We're going to look at supporting HSA from GCC (which would make it more > >>> or less trivial to also target openCL I think) > >> > >> > >> For the friends of link-time optimization (LTO): > >> > >> Unless I missed some fine point in OpenACC and OpenMP's target, they only > >> work with directives which are locally visible. Thus, if one does a > >> function > >> call in the device/target section, it can only be placed on the accelerator > >> if the function can be inlined. > >> > >> Thus, it would be useful, if LTO could be used to inline such function into > >> device code. I know one OpenACC code which calls functions in different > >> translation units (TU) - and the Cray compiler handles this via LTO. Thus, > >> it would be great if the HSA/OpenMP target/OpenACC middle-end > >> infrastructure > >> could do likewise, which also means deferring the error that an external > >> function cannot be used to the middle-end/LTO FE and not placing it into > >> the > >> FE. - In the mentioned code, the called function does not have any OpenACC > >> annotation but only consists of constructs which are permitted by the > >> accelerator - thus, no automatic code gen of accelerator code happens for > >> that. TU. > >> > >> (I just want to mention this to ensure that this kind of LTO/accelerator > >> inlining is kept in mind when implementing the infrastructure for > >> HSA/OpenACC/OpenMP target/OpenCL - even if cross-TU inlining is not > >> supported initially.) > > > > In my view we'd get the "regular" OpenMP processing done during omp > > lowering/expansion (which happens before LTO) which should mark the > > generated worker functions apropriately. Emitting accelerator code should > > then happen at LTRANS time, thus after all IPA inlining took place. The > > interesting bits we can borrow from OMP is basically marking of functions > > that are a) interesting, b) possible to transform. Unmarked functions / > > loops > > will have to go the autopar way, thus we have to prove via dependence > > analysis > > that executing iterations in parallel is possible. > > Btw, we plan to re-use the GOMP runtime as otherwise any synchronisation > between accelerator code and regular thread code is impossible.
I can't follow this line of reasoning. Can you elaborate? Which kind of synchronization are you referring to? As far as parallel execution and resource management is concerned, libgomp has just the kinds of scheduler that you need in the OpenMP rule set. Work-stealing schedulers such as Cilk's are others, and might actually become the more common approach. And there are other thread pools that programs might use; e.g., there's lots of discussion about all this in ISO C++ study group 1 on parallelism and concurrency, and several different proposals. With that in mind, I'm wondering whether the cooperative scheduling that we likely need should be at a lower level than libgomp or the Cilk runtime. Otherwise, libgomp needs to become the scheduler that runs them all (that is, if you want it to work well when combined with other abstractions for parallelism), and I'm not sure whether that's the right approach. > Which > means changing the GOMP runtime in a way to be able to pass a descriptor > which eventually has accelerator code (and a fallback regular function so > you can disable accelerator usage at runtime). It probably should be a list of different codes -- you might have more than one suitable accelerator available. BTW: What about putting this topic on the Cauldron agenda? Is there still time available to discuss what GCC might do regarding accelerators and HW heterogeneity? Torvald