https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #24 from Thorsten Kurth <thorstenkurth at me dot com> --- Hello Jakub, I know that the section you mean is racey and gets the wrong number of threads is not right but I put this in in order to see if I get the correct numbers on a CPU (I am not working on a GPU yet, that will be next). Most of the defines for setting number of teams and threads in outer loop are for playing around with that and see what works best, in the end this will be removed. This code is not finished by any means, it is a moving target and under active development. Only the OpenMP3 version is considered done and works well. You said that SIMD pragmas is missing, and this is for a reason. First of all, the code is memory bandwidth bound, so it has a rather low AI so that vectorization does not help a lot. Of course vectorization helps in the sense that the loads and stores are vectorized and the prefetcher works more efficiently. But we made sure that the (Intel) compiler vectorizes the inner loops automatically nicely. Putting in explicit SIMD pragmas made the code performance worse, because in that case the (Intel) compiler generates worse code in some cases (according to some Intel compiler guys, this is because if the compiler sees a SIMD statement it will not try to partially unroll loops etc. and might generate more masks than necessary). So auto vectorization works fine here so we have not revisited this issue. The GNU compiler might be different and I did not look at what the auto-vectorizer did. The more important questions I have are the following: 1) as you see the codes has two levels of parallelism. On the CPU, it is most efficient to tile the boxes (this is the loop with the target distribute) and then let one thread work on a box. I added another level of parallelism inside that box, because on the GPU you have more thread and might want to exploit more parallelism. Talking to folks from IBM at an OpemMP 4.5 hackathon at least this is what they suggested. So my question is: when you have a target teams distribute, will be one team equal to a CUDA WARP or will it be something bigger? In that case, I would like to have one WARP working on a box and not letting different ptx threads working on individual boxes. To summarize: on the CPU the OpenMP threading should be such that one threads gets a box and the vectorization works on the inner loop (which is fine, that works), and in the CUDA case one team/WARP should work on a box and then SIMT parallelize the work on the box. 2) related to this: how does ptx behave when it sees a SIMD statement in a target region? Is that ignored or somehow interpreted? In any case, how does OpenMP do the mapping between CUDA WARP <-> OpenMP CPU thread, because this is the closest equivalence I would say. I would guess it ignores SIMD pragmas and just acts on thread level, where in the CUDA world one thread more or less acts like a SIMD lane on the CPU. 3) this device mapping business is extremely verbose for C++ classes. For example the MFIter class amfi, comfy, solnLmfi whatever are not correctly mapped yet and would cause trouble on the GPU (the intel compiler complains that the stuff is not bitwise copyable, GNU complies it though). These are classes containing other class pointers. So in order to map that properly I would technically need to map the dereferenced data member of the member class of the first class, correct? As an example, you have a class with std::vector<XXX> * vector data member. You technically need to map the vector.data() member to the device, right? That however tells you that you need to be able to access that guy, i.e. it should not be a protected class member. So what happens when you have a class which you cannot change but need to map private/protected members of it? The example at hand is the MFIter class which has this: protected: const FabArrayBase& fabArray; IntVect tile_size; unsigned char flags; int currentIndex; int beginIndex; int endIndex; IndexType typ; const Array<int>* index_map; const Array<int>* local_index_map; const Array<Box>* tile_array; void Initialize (); It has these array pointers. So technically this is (to my knowledge, I do not know the code fully) an array of indices which determines which global indices the iterator is in fact iterating over. This stuff can be shared among the threads and it is only read and never written. Nevertheless, it needs to know the indices on the device so the index_map etc. needs to be mapped. Now, Array is just a class with a public member of std::vector. But in order to map the index_map class member I would need to have access to it, so that I can map the underlying std::vector data member. Do you know what I mean? How is this done in the most elegant way in OpenMP?