[Bug c++/80859] Performance Problems with OpenMP 4.5 support

thorstenkurth at me dot com Fri, 26 May 2017 08:15:25 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859


--- Comment #24 from Thorsten Kurth <thorstenkurth at me dot com> ---
Hello Jakub,

I know that the section you mean is racey and gets the wrong number of threads
is not right but I put this in in order to see if I get the correct numbers on
a CPU (I am not working on a GPU yet, that will be next). Most of the defines
for setting number of teams and threads in outer loop are for playing around
with that and see what works best, in the end this will be removed. This code
is not finished by any means, it is a moving target and under active
development. Only the OpenMP3 version is considered done and works well.

You said that SIMD pragmas is missing, and this is for a reason. First of all,
the code is memory bandwidth bound, so it has a rather low AI so that
vectorization does not help a lot. Of course vectorization helps in the sense
that the loads and stores are vectorized and the prefetcher works more
efficiently. But we made sure that the (Intel) compiler vectorizes the inner
loops automatically nicely. Putting in explicit SIMD pragmas made the code
performance worse, because in that case the (Intel) compiler generates worse
code in some cases (according to some Intel compiler guys, this is because if
the compiler sees a SIMD statement it will not try to partially unroll loops
etc. and might generate more masks than necessary). So auto vectorization works
fine here so we have not revisited this issue. The GNU compiler might be
different and I did not look at what the auto-vectorizer did.

The more important questions I have are the following:
1) as you see the codes has two levels of parallelism. On the CPU, it is most
efficient to tile the boxes (this is the loop with the target distribute) and
then let one thread work on a box. I added another level of parallelism inside
that box, because on the GPU you have more thread and might want to exploit
more parallelism. Talking to folks from IBM at an OpemMP 4.5 hackathon at least
this is what they suggested. 
So my question is: when you have a target teams distribute, will be one team
equal to a CUDA WARP or will it be something bigger? In that case, I would like
to have one WARP working on a box and not letting different ptx threads working
on individual boxes. To summarize: on the CPU the OpenMP threading should be
such that one threads gets a box and the vectorization works on the inner loop
(which is fine, that works), and in the CUDA case one team/WARP should work on
a box and then SIMT parallelize the work on the box.

2) related to this: how does ptx behave when it sees a SIMD statement in a
target region? Is that ignored or somehow interpreted? In any case, how does
OpenMP do the mapping between CUDA WARP <-> OpenMP CPU thread, because this is
the closest equivalence I would say. I would guess it ignores SIMD pragmas and
just acts on thread level, where in the CUDA world one thread more or less acts
like a SIMD lane on the CPU.

3) this device mapping business is extremely verbose for C++ classes. For
example the MFIter class amfi, comfy, solnLmfi whatever are not correctly
mapped yet and would cause trouble on the GPU (the intel compiler complains
that the stuff is not bitwise copyable, GNU complies it though). These are
classes containing other class pointers. So in order to map that properly I
would technically need to map the dereferenced data member of the member class
of the first class, correct? As an example, you have a class with
std::vector<XXX> * vector data member. You technically need to map the
vector.data() member to the device, right? That however tells you that you need
to be able to access that guy, i.e. it should not be a protected class member.
So what happens when you have a class which you cannot change but need to map
private/protected members of it? The example at hand is the MFIter class which
has this:

protected:

    const FabArrayBase& fabArray;

    IntVect tile_size;

    unsigned char flags;
    int           currentIndex;
    int           beginIndex;
    int           endIndex;
    IndexType     typ;

    const Array<int>* index_map;
    const Array<int>* local_index_map;
    const Array<Box>* tile_array;

    void Initialize ();

It has these array pointers. So technically this is (to my knowledge, I do not
know the code fully) an array of indices which determines which global indices
the iterator is in fact iterating over. This stuff can be shared among the
threads and it is only read and never written. Nevertheless, it needs to know
the indices on the device so the index_map etc. needs to be mapped. Now, Array
is just a class with a public member of std::vector. But in order to map the
index_map class member I would need to have access to it, so that I can map the
underlying std::vector data member. Do you know what I mean? How is this done
in the most elegant way in OpenMP?

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

Reply via email to