On Mon, Aug 21, 2023 at 6:23 PM Tobias Burnus <tob...@codesourcery.com> wrote: > > RFC – and idea how to handle this best in GCC? See the two examples > below for what we would like to support. > > * * * > > In GCC, we handle OpenMP (and OpenACC) by parsing the input file once, > produce an internal representation (in LTO format) for offloading code > and only at link time process it by passing it via the LTO wrapper to > the offloading-device compilers (mkoffload / device lto1). > See https://gcc.gnu.org/wiki/Offloading > > This works okayish - even though it causes some issues like with > metadirectives (they are implemened on the OG13 branch, however). > And with declare variant or a nohost version, where getting rid of > the host version is not that easy as it has to be in there until > omp-offload.cc's functions are run, which comes rather late. > > There are currently already some issues like with -ffast-math > and GLIBC's finite math functions, which are not be available > on the device side when using newlib's libm.. > (However, GLIBC has removed those.) > > Likewise, it would be nice to do like Clang+LLVM does: Auto-enable > some device-specific math functions. (Albeit that won't work well > with Fortran.) > > > However, with OpenMP 5.1, there is a real issue. In 5.1, Appendix B > it reads as: > "For C/C++, the declare variant directive was extended to support elision > of preprocessed code and to allow enclosed function definitions to be > interpreted as variant functions (see Section 7.5)." > > The problem is the "elision of preprocessed" as it permits code like the > following: > > |#ifdef _OPENMP #pragma omp begin declare variant > match(device={arch=NVPTX}) #include "cuda/math.h" #pragma omp begin > declare variant match(device={isa=sm70}) #include "cuda/sm70/math.h" > #pragma omp end declare variant #pragma omp end declare variant #pragma > omp begin declare variant match(arch=AMD) #include "amdgpu/math.h" > #pragma omp end declare variant #endif| > > And such code needs to keep working if there is a '#define ABC ...' in > one file and an '#ifndef ABC / #define ABC ...' in the other file. > > Additionally, it would be neat if it would handle target-specific defines > like '#if __PTX_SM__ == 350' for the relevant parts (here: arch=nvptx). > (We already do support context selectors via the gcc/config/*/t-omp-device > files; > see also https://gcc.gnu.org/onlinedocs/libgomp/OpenMP-Context-Selectors.html > ) > > Thoughts?
Err, so the OMP standard doesn't put any constraints on what to allow inside the variants? Is declare variant always at the toplevel? > * * * > > The question is also what to support – "just" function declarations which are > specific > to a device or some generic replacement of the kind: > > |#pragma omp begin declare variant match(device={arch=NVPTX})| > #define NUM_THREADS 128 > #pragma omp end declare variant > |#pragma omp begin declare variant match(device={arch=AMDGCN})| > #define NUM_THREADS 64 > #pragma omp end declare variant > > #ifndef NUM_THREADSß > #define NUM_THREADS 16 > #endif > > ... > printf ("Running with %d threads\n", NUM_THREADS); > #pragma omp parallel for num_threads(NUM_THREADS) > > * * * > > If we only handle 'begin/end declare variant', the following > works in principle: > - Parse the file once with only host-code parsing but > - keep track of delimited '|omp begin declare variant|' > where the context selector matches one of the supported > offload targets. > - parse the file n-times again but this time set the > target-#defined (extended version of gcc/config/*/t-omp-device > to make them available?) > - When doing so, ignore all non-offloading bits (issue: implicit > 'declare target' + have the data available for variant resolution). > - Store this in some way. > > But it is not really clear to me how to do this in actual code. But does that really help? Consider #ifdef _OPENMP #pragma omp begin declare variant match(device={arch=NVPTX}) #include "cuda/math.h" ... #pragma omp begin declare variant match(device={arch=NVPTX}) #include "conflicting with cuda/math.h" ... or is there a constraint that "un-varianting" same-match variants need to produce a valid translation unit? That is, don't you get combinatorical explosion with sequenced variants? Does the OMP standard at all think of how the resulting C/C++ translation unit is formed or does it simply take each variant as "finishing" a TU after omp end declare variant? Thus do declarations leak out of the "active" variant into the following parts of the C/C++ TU? To me it really looks like a very badly designed feature, not to mention that it involves the preprocessor ... > Any suggestion? Something like you propose. I'd even do it "harder", inventing a new omppd (openmp preprocessor driver) which will pre-parse a TU and invoke several compiler instances (GCC drivers) with -fomp-variant=X making only variants "X" active. Doesn't really solve the issue with sequenced variants unless there are constraints in the OMP spec making that work. It should be possible to have the separate compilers produce LTO bytecode (for the offload target then) from the "same" C TU and combine them at WPA time. All the offload table handling might need to improve here of course, but the omppd might produce enough meta data to help here. That said, I really wouldn't try to fiddle "omppd" into the host compiler parts, that doesn't sound fun for maintainance purposes. Richard. > Tobias > > PS: I would like to have some input before the Cauldron, but we might want > to additionally discuss this in detail during the cauldron, possibly some > brainstorming before the BoF and then surely also in the BoF. > > ----------------- > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 > München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas > Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht > München, HRB 106955