[Bug libgomp/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #27 from Thorsten Kurth --- Hello Jakub, I wanted to follow up on this. Is there any progress on this issue? Best Regards Thorsten Kurth
[Bug c++/81850] New: OpenMP target enter data compilation issues
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81850 Bug ID: 81850 Summary: OpenMP target enter data compilation issues Product: gcc Version: 7.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thorstenkurth at me dot com Target Milestone: --- Created attachment 41990 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41990&action=edit Test case Dear Sir/Madam, g++ 7.1.1 cannot compile correct OpenMP 4.5 code. I have attached a small example program I initially developer to demonstrate a compiler bug for XLC. GCC throws the following error message on compilation: g++ -O2 -std=c++11 -fopenmp -foffload=nvptx-none -c aclass.cpp -o aclass.o In file included from aclass.h:2:0, from aclass.cpp:1: masterclass.h: In member function 'void master::allocate(const unsigned int&)': masterclass.h:10:50: error: 'master::data' is not a variable in 'map' clause #pragma omp target enter data map(alloc: data[0:size*sizeof(double)]) ^~~~ masterclass.h:10:9: error: '#pragma omp target enter data' must contain at least one 'map' clause #pragma omp target enter data map(alloc: data[0:size*sizeof(double)]) ^~~ masterclass.h: In member function 'void master::deallocate()': masterclass.h:15:51: error: 'master::data' is not a variable in 'map' clause #pragma omp target exit data map(release: data[:0]) ^~~~ masterclass.h:15:9: error: '#pragma omp target exit data' must contain at least one 'map' clause #pragma omp target exit data map(always, release: data[:0]) To me it seems that it cannot recognize the "alloc" clause. Best Regards Thorsten Kurth
[Bug c++/81896] New: omp target enter data not recognized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81896 Bug ID: 81896 Summary: omp target enter data not recognized Product: gcc Version: 7.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thorstenkurth at me dot com Target Milestone: --- Created attachment 42005 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42005&action=edit small test case Dear Sir/Madam, I am not sure if my report got posted the first time because I cannot find it any more (did not receive a notification about it and it is also not marked invalid somewhere). Therefore, I will post it again. It seems that gcc has problems with the omp target enter/exit data constructs. When I compile the appended code I get: g++ -O2 -std=c++11 -fopenmp -foffload=nvptx-none -c aclass.cpp -o aclass.o In file included from aclass.h:2:0, from aclass.cpp:1: masterclass.h: In member function 'void master::allocate(const unsigned int&)': masterclass.h:10:50: error: 'master::data' is not a variable in 'map' clause #pragma omp target enter data map(alloc: data[0:size*sizeof(double)]) ^~~~ masterclass.h:10:9: error: '#pragma omp target enter data' must contain at least one 'map' clause #pragma omp target enter data map(alloc: data[0:size*sizeof(double)]) ^~~ masterclass.h: In member function 'void master::deallocate()': masterclass.h:15:51: error: 'master::data' is not a variable in 'map' clause #pragma omp target exit data map(release: data[:0]) ^~~~ masterclass.h:15:9: error: '#pragma omp target exit data' must contain at least one 'map' clause #pragma omp target exit data map(release: data[:0]) ^~~ make: *** [aclass.o] Error 1 The same code compiles fine when using XLC. Best Regards Thorsten Kurth
[Bug c++/80859] New: Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 Bug ID: 80859 Summary: Performance Problems with OpenMP 4.5 support Product: gcc Version: 6.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thorstenkurth at me dot com Target Milestone: --- Dear Sir/Madam, I am working on the Cori HPC system, a Cray XC-40 with intel Xeon Phi 7250. I probably found a performance "bug" when using the OpenMP 4.5 target directives. It seems to me that the GNU compiler generates unnecessary move and push functions when a #pragma omp target region is present but no offloading is used. I have attached a test case to illustrate that problem. Please compile the nested_test_omp_4dot5.x in the directory (don't be confused by the name, I am not using nested OpenMP here). Then go into the corresponding .cpp file and comment out the target-related directives (target teams and distribute), compile again and then compare the assembly code. The code with the target directives has more pushes and moves than the one without. I think I also place the output of that process in the directory already, the files ending in .as. The performance overhead is marginal here but I am currently working on a Department of Energy performance portability project and I am exploring the usefulness of OpenMP 4.5. The code we retargeting is a Geometric Multigrid in the BoxLiv/AMReX framework and there the overhead is significant. I could observe as much as 10x slowdown accumulated throughout the app. This code is bigger and thus I do not want to demonstrate that here but I could send you an invitation to the github repo if requested. In my opinion, if no offloading is used, the compiler should just ignore the target region statements and just default to plain OpenMP. Please let me know what you think. Best Regards Thorsten Kurth National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #3 from Thorsten Kurth --- Created attachment 41414 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41414&action=edit OpenMP 4.5 Testcase This is the source code
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #4 from Thorsten Kurth --- Created attachment 41415 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41415&action=edit Testcase This is the test case. The files ending on .as contain the assembly code with and without target region
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #5 from Thorsten Kurth --- To clarify the problem: I think that the additional movq, pushq and other instructions generated when using the target directive can cause a big hit on the performance. I understand that these instructions are necessary when offloading is used but in case when I compile for native architecture those should not be there. So maybe I am just missing a GNU compiler flag which disables offloading and lets the compiler ignore the target, teams and distribute directives at compile time but still honoring all the other OpenMP constructs. Is there a way to do that right now and if not, is there a way to add that flag that supports this.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #7 from Thorsten Kurth --- Hello Jakub, thanks for your comment but I think the parallel for is not racey. Every thread is working a block of i-indices so that is fine. The dotprod kernel is actually a kernel from the OpenMP standard documentation and I am sure that this is not racey. The example with the regions you mentioned I do not see a problem with that either. By default, everything is shared so the variable is updated by all the threads/teams with the same value. The issue is that num_teams=1 is only true for CPU, for GPU it is OS, driver, architecture and whatever dependent. Concerning splitting distribute and parallel: I tried both combinations and found that they behave the same. But in the end I split it so that I could comment out the distribute section to see if that makes a performance difference (and it does). I believe that the overhead instructions are responsible for the bad performance because that is the only thing which distinguishes the target annotated code from the plain openmp code. I used vtune to look at thread utilization and they look similar, L1, L2 hit rates are very close (100% vs 99% and 92% vs 89%) for the plain openmp and for the target annotated code. BUT the performance of the target annotated code can be up to 10x worse. So I think there might be register spilling due to copying a large amount of variables. If you like I can point you to the github repo code (BoxLib) which clearly exhibits this issue. This small test case only shows minor overhead of OpenMP 4.5 vs, say, OpenMP 3 but it clearly generates some additional overhead.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #8 from Thorsten Kurth --- Here is the output of the get_num_threads section: [tkurth@cori02 omp_3_vs_45_test]$ export OMP_NUM_THREADS=32 [tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x We got 1 teams and 32 threads. and: [tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x We got 1 teams and 12 threads. I think the code is OK.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #9 from Thorsten Kurth --- Sorry, in the second run I set the number of threads to 12. I think the code works as expected.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #11 from Thorsten Kurth --- Hello Jakub, yes, you are right. I thought that map(tofrom:) is the default mapping but I might be wrong. In any case, teams is always 1. So this code is basically just data streaming so there is no need for a detailed performance analysis. When I timed the code (not profiling it) the OpenMP 4.5 code had a tiny bit more overhead, but not significant. However, we might nevertheless learn from that. Best Thorsten
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #13 from Thorsten Kurth --- Hello Jakub, the compiler options are just -fopenmp. I am sure it does not have to do anything with vectorization as I compare the code runtime with and without the target directives and thus vectorization should be the same between them. The remaining OpenMP sections are the same. In our work we have not seen 10x because of insufficient vectorization, it is usually because of cache locality but that is the same for OMP 4.5 and OMP 3 because the loops are not touched. I do not specify an ISA choice, but I will try specifying KNL now and will tell you what the compiler is going to do. Best Thorsten
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #15 from Thorsten Kurth --- The code I care about definitely has optimization enabled. For the fortran stuff it does (for example): ftn -g -O3 -ffree-line-length-none -fno-range-check -fno-second-underscore -Jo/3d.gnu.MPI.OMP.EXE -I o/3d.gnu.MPI.OMP.EXE -fimplicit-none -fopenmp -I. -I../../Src/C_BoundaryLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I../../Src/C_BaseLib -I../../Src/C_BoundaryLib -I../../Src/C_BaseLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I/opt/intel/vtune_amplifier_xe_2017.2.0.499904/include -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/F_BaseLib -I../../Src/F_BaseLib -c ../../Src/LinearSolvers/F_MG/itsol.f90 -o o/3d.gnu.MPI.OMP.EXE/itsol.o Compiling cc_mg_tower_smoother.f90 ... and for the C++ stuff it does CC -g -O3 -std=c++14 -fopenmp -g -DCG_USE_OLD_CONVERGENCE_CRITERIA -DBL_OMP_FABS -DDEVID=0 -DNUM_TEAMS=1 -DNUM_THREADS_PER_BOX=1 -march=knl -DNDEBUG -DBL_USE_MPI -DBL_USE_OMP -DBL_GCC_VERSION='6.3.0' -DBL_GCC_MAJOR_VERSION=6 -DBL_GCC_MINOR_VERSION=3 -DBL_SPACEDIM=3 -DBL_FORT_USE_UNDERSCORE -DBL_Linux -DMG_USE_FBOXLIB -DBL_USE_F_BASELIB -DBL_USE_FORTRAN_MPI -DUSE_F90_SOLVERS -I. -I../../Src/C_BoundaryLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I../../Src/C_BaseLib -I../../Src/C_BoundaryLib -I../../Src/C_BaseLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I/opt/intel/vtune_amplifier_xe_2017.2.0.499904/include -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/F_BaseLib -I../../Src/F_BaseLib -c ../../Src/C_BaseLib/FPC.cpp -o o/3d.gnu.MPI.OMP.EXE/FPC.o Compiling Box.cpp ... But the kernels I care about are in C++.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #16 from Thorsten Kurth --- FYI, the code is: https://github.com/zronaghi/BoxLib.git in branch cpp_kernels_openmp4dot5 and then in Src/LinearSolvers/C_CellMG the file ABecLaplacian.cpp. For example, lines 542 and 543 can be commented out and commented in and when the test case in run you get significant slowdown when the code is compiled with that stuff commented in. I did not map all the scalar stuff so it might be that this is a problem. But in any case, it should not create copies of that stuff at all in my opinion. Please don't look at that code right now because it is a bit convoluted I just wanted to show that this issue appears. So when I have the target section I mentioned above commented in I get by running: #!/bin/bash export OMP_NESTED=false export OMP_NUM_THREADS=64 export OMP_PLACES=threads export OMP_PROC_BIND=spread export OMP_MAX_ACTIVE_LEVELS=1 execpath="/project/projectdirs/mpccc/tkurth/Portability/BoxLib/Tutorials/MultiGrid_C" exec=`ls -latr ${execpath}/main3d.*.MPI.OMP.ex | awk '{print $9}'` #execute ${exec} inputs the following: tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh MPI initialized with 1 MPI processes OMP initialized with 64 OMP threads Using Dirichlet or Neumann boundary conditions. Grid resolution : 128 (cells) Domain size : 1 (length unit) Max_grid_size : 32 (cells) Number of grids : 64 Sum of RHS : -2.68882138776405e-17 Solving with BoxLib C++ solver WARNING: using C++ kernels in LinOp WARNING: using C++ MG solver with C kernels MultiGrid: Initial rhs= 135.516568492921 MultiGrid: Initial residual = 135.516568492921 MultiGrid: Iteration 1 resid/bnorm = 0.379119045820053 MultiGrid: Iteration 2 resid/bnorm = 0.0107971623268356 MultiGrid: Iteration 3 resid/bnorm = 0.000551321916982188 MultiGrid: Iteration 4 resid/bnorm = 3.55014555643671e-05 MultiGrid: Iteration 5 resid/bnorm = 2.57082340920002e-06 MultiGrid: Iteration 6 resid/bnorm = 1.90970439886018e-07 MultiGrid: Iteration 7 resid/bnorm = 1.44525222814178e-08 MultiGrid: Iteration 8 resid/bnorm = 1.10675190626368e-09 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 , Solve time: 5.84898591041565, CG time: 0.162226438522339 Converged res < eps_rel*max(bnorm,res_norm) Run time : 5.98936820030212 Unused ParmParse Variables: [TOP]::hypre.solver_flag(nvals = 1) :: [1] [TOP]::hypre.pfmg_rap_type(nvals = 1) :: [1] [TOP]::hypre.pfmg_relax_type(nvals = 1) :: [2] [TOP]::hypre.num_pre_relax(nvals = 1) :: [2] [TOP]::hypre.num_post_relax(nvals = 1) :: [2] [TOP]::hypre.skip_relax(nvals = 1) :: [1] [TOP]::hypre.print_level(nvals = 1) :: [1] done. When I comment it out, recompile, I get: tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh MPI initialized with 1 MPI processes OMP initialized with 64 OMP threads Using Dirichlet or Neumann boundary conditions. Grid resolution : 128 (cells) Domain size : 1 (length unit) Max_grid_size : 32 (cells) Number of grids : 64 Sum of RHS : -2.68882138776405e-17 Solving with BoxLib C++ solver WARNING: using C++ kernels in LinOp WARNING: using C++ MG solver with C kernels MultiGrid: Initial rhs= 135.516568492921 MultiGrid: Initial residual = 135.516568492921 MultiGrid: Iteration 1 resid/bnorm = 0.379119045820053 MultiGrid: Iteration 2 resid/bnorm = 0.0107971623268356 MultiGrid: Iteration 3 resid/bnorm = 0.000551321916981978 MultiGrid: Iteration 4 resid/bnorm = 3.5501455563633e-05 MultiGrid: Iteration 5 resid/bnorm = 2.5708234090034e-06 MultiGrid: Iteration 6 resid/bnorm = 1.90970439781153e-07 MultiGrid: Iteration 7 resid/bnorm = 1.44525225042545e-08 MultiGrid: Iteration 8 resid/bnorm = 1.10675108045705e-09 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 , Solve time: 0.759385108947754, CG time: 0.14183521270752 Converged res < eps_rel*max(bnorm,res_norm) Run time : 0.879786014556885 Unused ParmParse Variables: [TOP]::hypre.solver_flag(nvals = 1) :: [1] [TOP]::hypre.pfmg_rap_type(nvals = 1) :: [1] [TOP]::hypre.pfmg_relax_type(nvals = 1) :: [2] [TOP]::hypre.num_pre_relax(nvals = 1) :: [2] [TOP]::hypre.num_post_relax(nvals = 1) :: [2] [TOP]::hypre.skip_relax(nvals = 1) :: [1] [TOP]::hypre.print_level(nvals = 1) :: [1] done. it is like 7.3x slowdown. The smoothing kernel (gauss-seidel red-black) is the most expensive kernel in the Multi-Grid code, so I see the biggest effect here. But the other kernels (prolongation, restriction, dot products etc) have slowdowns as well amounting to a total of more than 10x for the whole app.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #17 from Thorsten Kurth --- the result though is correct, I verified that both codes generate the correct output.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #19 from Thorsten Kurth --- Thanks you very much. I am sorry that I do not have a simpler test case. The kernel which is executed is in the same directory as ABecLaplacian and called MG_3D_cpp.cpp. We have seen similar problems with the fortran kernels (they are scattered across multiple files) but the fortran kernels and our C++ ports give the same performance with the original OpenMP parallelization. In any case, I wonder why the compiler honors the target region even if -march=knl is specified. However, please let me know if you have further questions. I can guide you through that code. The code is big but the relevant files are technically 2 or 3 and the relevant lines of code also not very many.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #20 from Thorsten Kurth --- To compile the code, edit the GNUmakefile to suit your needs (feel free to ask any questions) and in order to run it run the generated executable, called something like main3d.XXX... and the XXX tell you if you compiled with MPI, OpenMP, etc. There is an inputs file you just pass to it: ./main3d.. inputs That's it. Tell me if you need more info.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #22 from Thorsten Kurth --- Hello Jakub, that is stuff for Intel vTune. I have commented it out and added the NUM_TEAMS defines in the GNUmakefile. Please pull the latest changes. Best and thanks Thorsten
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #24 from Thorsten Kurth --- Hello Jakub, I know that the section you mean is racey and gets the wrong number of threads is not right but I put this in in order to see if I get the correct numbers on a CPU (I am not working on a GPU yet, that will be next). Most of the defines for setting number of teams and threads in outer loop are for playing around with that and see what works best, in the end this will be removed. This code is not finished by any means, it is a moving target and under active development. Only the OpenMP3 version is considered done and works well. You said that SIMD pragmas is missing, and this is for a reason. First of all, the code is memory bandwidth bound, so it has a rather low AI so that vectorization does not help a lot. Of course vectorization helps in the sense that the loads and stores are vectorized and the prefetcher works more efficiently. But we made sure that the (Intel) compiler vectorizes the inner loops automatically nicely. Putting in explicit SIMD pragmas made the code performance worse, because in that case the (Intel) compiler generates worse code in some cases (according to some Intel compiler guys, this is because if the compiler sees a SIMD statement it will not try to partially unroll loops etc. and might generate more masks than necessary). So auto vectorization works fine here so we have not revisited this issue. The GNU compiler might be different and I did not look at what the auto-vectorizer did. The more important questions I have are the following: 1) as you see the codes has two levels of parallelism. On the CPU, it is most efficient to tile the boxes (this is the loop with the target distribute) and then let one thread work on a box. I added another level of parallelism inside that box, because on the GPU you have more thread and might want to exploit more parallelism. Talking to folks from IBM at an OpemMP 4.5 hackathon at least this is what they suggested. So my question is: when you have a target teams distribute, will be one team equal to a CUDA WARP or will it be something bigger? In that case, I would like to have one WARP working on a box and not letting different ptx threads working on individual boxes. To summarize: on the CPU the OpenMP threading should be such that one threads gets a box and the vectorization works on the inner loop (which is fine, that works), and in the CUDA case one team/WARP should work on a box and then SIMT parallelize the work on the box. 2) related to this: how does ptx behave when it sees a SIMD statement in a target region? Is that ignored or somehow interpreted? In any case, how does OpenMP do the mapping between CUDA WARP <-> OpenMP CPU thread, because this is the closest equivalence I would say. I would guess it ignores SIMD pragmas and just acts on thread level, where in the CUDA world one thread more or less acts like a SIMD lane on the CPU. 3) this device mapping business is extremely verbose for C++ classes. For example the MFIter class amfi, comfy, solnLmfi whatever are not correctly mapped yet and would cause trouble on the GPU (the intel compiler complains that the stuff is not bitwise copyable, GNU complies it though). These are classes containing other class pointers. So in order to map that properly I would technically need to map the dereferenced data member of the member class of the first class, correct? As an example, you have a class with std::vector * vector data member. You technically need to map the vector.data() member to the device, right? That however tells you that you need to be able to access that guy, i.e. it should not be a protected class member. So what happens when you have a class which you cannot change but need to map private/protected members of it? The example at hand is the MFIter class which has this: protected: const FabArrayBase& fabArray; IntVect tile_size; unsigned char flags; int currentIndex; int beginIndex; int endIndex; IndexType typ; const Array* index_map; const Array* local_index_map; const Array* tile_array; void Initialize (); It has these array pointers. So technically this is (to my knowledge, I do not know the code fully) an array of indices which determines which global indices the iterator is in fact iterating over. This stuff can be shared among the threads and it is only read and never written. Nevertheless, it needs to know the indices on the device so the index_map etc. needs to be mapped. Now, Array is just a class with a public member of std::vector. But in order to map the index_map class member I would need to have access to it, so that I can map the underlying std::vector data member. Do you know what I mean? How is this done in the most elegant way in OpenMP?
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #26 from Thorsten Kurth --- Hello Jakub, thanks for the clarification. So a team maps to a CTA which is somewhat equivalent to a block in CUDA language, correct? And it is good to have some categorical equivalency between GPU and CPU code (SIMD units <> WARPS) instead of mapping SIMT threads to OpenMP threads, that makes it easier for making it portable. About my mapping "problem" is there an elegant way for doing this or does only brute force work, i.e. by writing additional member functions returning pointers etc.? In general, the OpenMP mapping business is very verbose (not your fault, I know), it makes the code very annoying to read. Best Thorsten
[Bug c++/82629] New: OpenMP 4.5 Target Region mangling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82629 Bug ID: 82629 Summary: OpenMP 4.5 Target Region mangling problem Product: gcc Version: 7.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thorstenkurth at me dot com Target Milestone: --- Dear Sir/Madam, I run into linking issues with gcc (GCC) 7.1.1 20170718 and OpenMP 4.5 target offloading. I am compiling a mixed fortran/C++ code where target regions can be in both source files. The final linking stage fails with the following error message: mpic++ -g -O3 -std=c++11 -fopenmp -foffload=nvptx-none -DCG_USE_OLD_CONVERGENCE_CRITERIA -DBL_OMP_FABS -DNDEBUG -DBL_USE_MPI -DBL_USE_OMP -DBL_GCC_VERSION='7.1.1' -DBL_GCC_MAJOR_VERSION=7 -DBL_GCC_MINOR_VERSION=1 -DBL_SPACEDIM=3 -DBL_FORT_USE_UNDERSCORE -DBL_Linux -DMG_USE_FBOXLIB -DBL_USE_F_BASELIB -DBL_USE_FORTRAN_MPI -DUSE_F90_SOLVERS -I. -I../../Src/C_BoundaryLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I../../Src/C_BaseLib -I../../Src/C_BoundaryLib -I../../Src/C_BaseLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/F_BaseLib -I../../Src/F_BaseLib -L. -o main3d.gnu.MPI.OMP.ex o/3d.gnu.MPI.OMP.EXE/main.o o/3d.gnu.MPI.OMP.EXE/writePlotFile.o o/3d.gnu.MPI.OMP.EXE/FabSet.o o/3d.gnu.MPI.OMP.EXE/BndryRegister.o o/3d.gnu.MPI.OMP.EXE/Mask.o o/3d.gnu.MPI.OMP.EXE/MultiMask.o o/3d.gnu.MPI.OMP.EXE/BndryData.o o/3d.gnu.MPI.OMP.EXE/InterpBndryData.o o/3d.gnu.MPI.OMP.EXE/MacBndry.o o/3d.gnu.MPI.OMP.EXE/ABecLaplacian.o o/3d.gnu.MPI.OMP.EXE/CGSolver.o o/3d.gnu.MPI.OMP.EXE/LinOp.o o/3d.gnu.MPI.OMP.EXE/Laplacian.o o/3d.gnu.MPI.OMP.EXE/MultiGrid.o o/3d.gnu.MPI.OMP.EXE/ABec2.o o/3d.gnu.MPI.OMP.EXE/ABec4.o o/3d.gnu.MPI.OMP.EXE/BoxLib.o o/3d.gnu.MPI.OMP.EXE/ParmParse.o o/3d.gnu.MPI.OMP.EXE/Utility.o o/3d.gnu.MPI.OMP.EXE/UseCount.o o/3d.gnu.MPI.OMP.EXE/DistributionMapping.o o/3d.gnu.MPI.OMP.EXE/ParallelDescriptor.o o/3d.gnu.MPI.OMP.EXE/VisMF.o o/3d.gnu.MPI.OMP.EXE/Arena.o o/3d.gnu.MPI.OMP.EXE/BArena.o o/3d.gnu.MPI.OMP.EXE/CArena.o o/3d.gnu.MPI.OMP.EXE/OMPArena.o o/3d.gnu.MPI.OMP.EXE/NFiles.o o/3d.gnu.MPI.OMP.EXE/FabConv.o o/3d.gnu.MPI.OMP.EXE/FPC.o o/3d.gnu.MPI.OMP.EXE/Box.o o/3d.gnu.MPI.OMP.EXE/IntVect.o o/3d.gnu.MPI.OMP.EXE/IndexType.o o/3d.gnu.MPI.OMP.EXE/Orientation.o o/3d.gnu.MPI.OMP.EXE/Periodicity.o o/3d.gnu.MPI.OMP.EXE/RealBox.o o/3d.gnu.MPI.OMP.EXE/BoxList.o o/3d.gnu.MPI.OMP.EXE/BoxArray.o o/3d.gnu.MPI.OMP.EXE/BoxDomain.o o/3d.gnu.MPI.OMP.EXE/FArrayBox.o o/3d.gnu.MPI.OMP.EXE/IArrayBox.o o/3d.gnu.MPI.OMP.EXE/BaseFab.o o/3d.gnu.MPI.OMP.EXE/MultiFab.o o/3d.gnu.MPI.OMP.EXE/iMultiFab.o o/3d.gnu.MPI.OMP.EXE/FabArray.o o/3d.gnu.MPI.OMP.EXE/CoordSys.o o/3d.gnu.MPI.OMP.EXE/Geometry.o o/3d.gnu.MPI.OMP.EXE/MultiFabUtil.o o/3d.gnu.MPI.OMP.EXE/BCRec.o o/3d.gnu.MPI.OMP.EXE/PhysBCFunct.o o/3d.gnu.MPI.OMP.EXE/PlotFileUtil.o o/3d.gnu.MPI.OMP.EXE/BLProfiler.o o/3d.gnu.MPI.OMP.EXE/BLBackTrace.o o/3d.gnu.MPI.OMP.EXE/MemPool.o o/3d.gnu.MPI.OMP.EXE/MGT_Solver.o o/3d.gnu.MPI.OMP.EXE/FMultiGrid.o o/3d.gnu.MPI.OMP.EXE/MultiFab_C_F.o o/3d.gnu.MPI.OMP.EXE/backtrace_c.o o/3d.gnu.MPI.OMP.EXE/fabio_c.o o/3d.gnu.MPI.OMP.EXE/timer_c.o o/3d.gnu.MPI.OMP.EXE/BLutil_F.o o/3d.gnu.MPI.OMP.EXE/BLParmParse_F.o o/3d.gnu.MPI.OMP.EXE/BLBoxLib_F.o o/3d.gnu.MPI.OMP.EXE/BLProfiler_F.o o/3d.gnu.MPI.OMP.EXE/INTERPBNDRYDATA_3D.o o/3d.gnu.MPI.OMP.EXE/LO_UTIL.o o/3d.gnu.MPI.OMP.EXE/ABec_3D.o o/3d.gnu.MPI.OMP.EXE/ABec_UTIL.o o/3d.gnu.MPI.OMP.EXE/LO_3D.o o/3d.gnu.MPI.OMP.EXE/LP_3D.o o/3d.gnu.MPI.OMP.EXE/MG_3D.o o/3d.gnu.MPI.OMP.EXE/ABec2_3D.o o/3d.gnu.MPI.OMP.EXE/ABec4_3D.o o/3d.gnu.MPI.OMP.EXE/COORDSYS_3D.o o/3d.gnu.MPI.OMP.EXE/FILCC_3D.o o/3d.gnu.MPI.OMP.EXE/BaseFab_nd.o o/3d.gnu.MPI.OMP.EXE/threadbox.o o/3d.gnu.MPI.OMP.EXE/MultiFabUtil_3d.o o/3d.gnu.MPI.OMP.EXE/mempool_f.o o/3d.gnu.MPI.OMP.EXE/compute_defect.o o/3d.gnu.MPI.OMP.EXE/coarsen_coeffs.o o/3d.gnu.MPI.OMP.EXE/mg_prolongation.o o/3d.gnu.MPI.OMP.EXE/ml_prolongation.o o/3d.gnu.MPI.OMP.EXE/cc_mg_cpp.o o/3d.gnu.MPI.OMP.EXE/cc_applyop.o o/3d.gnu.MPI.OMP.EXE/cc_ml_resid.o o/3d.gnu.MPI.OMP.EXE/cc_smoothers.o o/3d.gnu.MPI.OMP.EXE/cc_stencil.o o/3d.gnu.MPI.OMP.EXE/cc_stencil_apply.o o/3d.gnu.MPI.OMP.EXE/cc_stencil_fill.o o/3d.gnu.MPI.OMP.EXE/cc_interface_stencil.o o/3d.gnu.MPI.OMP.EXE/cc_mg_tower_smoother.o o/3d.gnu.MPI.OMP.EXE/itsol.o o/3d.gnu.MPI.OMP.EXE/mg.o o/3d.gnu.MPI.OMP.EXE/mg_tower.o o/3d.gnu.MPI.OMP.EXE/ml_cc.o o/3d.gnu.MPI.OMP.EXE/ml_nd.o o/3d.gnu.MPI.OMP.EXE/ml_norm.o o/3d.gnu.MPI.OMP.EXE/tridiag.o o/3d.gnu.MPI.OMP.EXE/nodal_mg_cpp.o o/3d.gnu.MPI.OMP.EXE/nodal_mask.o o/3d.gnu.MPI.OMP.EXE/nodal_divu.o o/3d.gnu.MPI.OMP.EXE/nodal_interface_stencil.o o/3d.gnu.MPI.OMP.EXE/nodal_newu.o o/3d.gnu.MPI.OMP.EXE/nodal_s
[Bug c++/81896] omp target enter data not recognized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81896 --- Comment #1 from Thorsten Kurth --- Hello, is this report actually being worked on? It is in unconfirmed state for quite a while now. Best Regards Thorsten Kurth
[Bug libgomp/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #28 from Thorsten Kurth --- Hello, can someone please give me an update on this bug? Best Regards Thorsten Kurth
[Bug c++/81896] omp target enter data not recognized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81896 --- Comment #2 from Thorsten Kurth --- Hello, another data point: when I create a dummy variable, it works: for example alias data to tmp and then use tmp. I think this is not working for the same reason one cannot arbitrarily put class member variables into openmp clauses. Best Regards Thorsten Kurth
[Bug c++/82629] OpenMP 4.5 Target Region mangling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82629 --- Comment #2 from Thorsten Kurth --- Created attachment 42420 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42420&action=edit This is the test case demonstrating the problem. Linking this code will produce: -bash-4.2$ make main.x g++ -O2 -std=c++11 -fopenmp -foffload=nvptx-none -c aclass.cpp -o aclass.o g++ -O2 -std=c++11 -fopenmp -foffload=nvptx-none -c bclass.cpp -o bclass.o g++ aclass.o bclass.o -o main.x lto1: fatal error: aclass.o: section _ZN6master4copyERKS_$_omp_fn$1 is missing compilation terminated. mkoffload: fatal error: powerpc64le-unknown-linux-gnu-accel-nvptx-none-gcc returned 1 exit status compilation terminated. lto-wrapper: fatal error: /autofs/nccs-svm1_sw/summitdev/gcc/7.1.1-20170802/bin/../libexec/gcc/powerpc64le-unknown-linux-gnu/7.1.1//accel/nvptx-none/mkoffload returned 1 exit status compilation terminated. /usr/bin/ld: lto-wrapper failed /usr/bin/sha1sum: main.x: No such file or directory collect2: error: ld returned 1 exit status make: *** [main.x] Error 1 But looking at the object in question shows: -bash-4.2$ nm aclass.o U .TOC. d .offload_func_table d .offload_var_table U GOMP_parallel U GOMP_target_enter_exit_data U GOMP_target_ext U GOMP_teams 0350 T _ZN6aclass4copyERKS_ 0250 T _ZN6aclass8allocateERKj 0130 t _ZN6master4copyERKS_._omp_fn.0 t _ZN6master4copyERKS_._omp_fn.1 d _ZZN6master10deallocateEvE18.omp_data_kinds.20 b _ZZN6master10deallocateEvE18.omp_data_sizes.19 0002 d _ZZN6master4copyERKS_E18.omp_data_kinds.11 0008 d _ZZN6master4copyERKS_E18.omp_data_sizes.10 U _ZdaPv U _Znam U __cxa_throw_bad_array_new_length 0001 C __gnu_lto_v1 U omp_get_num_teams U omp_get_num_threads U omp_get_team_num U omp_get_thread_num The function is actually there. Best Regards Thorsten Kurth
[Bug c++/82629] OpenMP 4.5 Target Region mangling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82629 --- Comment #3 from Thorsten Kurth --- One more thing, In the test case I send, please change the $(XPPFLAGS) in the main.x target compilation to $(CXXFLAGS), so that -fopenmp is used at link time also. However, that does not solve the problem but it makes the Makefile more correct (the XPPFLAGS was a remnant from something I tried out earlier). Sorry for that. Best Regards Thorsten Kurth
[Bug c++/82629] OpenMP 4.5 Target Region mangling problem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82629 --- Comment #4 from Thorsten Kurth --- Hello Richard, Was the test case received? Best Regards Thorsten Kurth
[Bug c/60101] New: Long compile times when mixed complex floating point datatypes are used in lengthy expressions
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60101 Bug ID: 60101 Summary: Long compile times when mixed complex floating point datatypes are used in lengthy expressions Product: gcc Version: 4.8.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: thorstenkurth at me dot com Created attachment 32071 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32071&action=edit Archive which includes test case. In the example I copied below, the double.c file compiles instantly whereas the float.c file takes very long. This is a truncated version of an even longer file (more lines of code in the loop) and the compile time for the float.c file grows like N^3, where N is the number of lines. Here is the output of the long version for 4.8.2: 0x40ae17 do_spec_1 ../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5269 0x40ae17 do_spec_1 ../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5269 0x40c875 process_brace_body ../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5872 0x40c875 process_brace_body ../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5872 0x40c875 handle_braces ../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5786 0x40c875 handle_braces ../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5786 0x40ae17 do_spec_1 ../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5269 0x40c875 process_brace_body ../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5872 and more messages like that The attached files both compile, but they the float.c takes significantly longer. The only difference between those files is that the temporary variable sum is double complex in the working version and float complex in the non-working version. So I guess, the compiler tries to reorganize the complex multiplications and additions so that intermediate floating point results can be used (this is what I guess). Both files compile using the icc (>=11.0) and clang/LLVM almost instantly. It also works for gcc<=4.4.