https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93591
Bug ID: 93591 Summary: Bad number of threads and place management on Power-9 (with OpenBLAS) Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: jeromerichard111 at msn dot com CC: jakub at gcc dot gnu.org Target Milestone: --- Created attachment 47781 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47781&action=edit Code used to reproduce the bug Hello, I benchmarked the simple following dgemm call using OpenBLAS (commit 8d2a796) with 4096x4096 matrices (thus n=4096 and a, b and c are matrices) on a IBM LC922 machine with 2 POWER-9 processors (of each 22 cores and each 88 hardware threads) with GCC-8.3.0: cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, a, n, b, n, 1.0, c, n); Performance results are very bad in some case: the number of threads actually created is always one for GCC-8.3.0 when OMP_PLACES is not set to "cores(...)". This is not the case with Clang-9.0 where the number of threads created is correct. This can also be reproduced using GCC-9.2.1. By looking OMP_DISPLAY_ENV when OMP_PLACES="cores(8)" (a configuration that create multiple threads and not just one) we can see that: OMP_PLACES = '{0:4},{4:4},{8:4},{12:4},{16:4},{20:4},{24:4},{28:4}' This configuration give good performance while the following does not as only one thread is created: OMP_PLACES = '{0},{4},{8},{12},{16},{20},{24},{28}' And surprisingly this one is fine (multiple threads are created): OMP_PLACES = '{0:2},{4},{8},{12},{16},{20},{24},{28}' Thus, the place of the first thread is important in libGOMP and strangely causes the issue that only one thread is created. I think this is most probably an issue in libGOMP and not GCC itself. All test are runned on a ubuntu18.04.1 system. Here is the command used to compile the basic example code: g++ -O3 -mcpu=native -ffast-math main.cpp -I./OpenBLAS -L./OpenBLAS -lopenblas -fopenmp Here is an example of results (with only 8 threads put on 8 different cores): $ OMP_NUM_THREADS=8 OMP_PLACES="{0:2},{4},{8},{12},{16},{20},{24},{28}" OMP_PROC_BIND=TRUE ./a.out 167.602 Gflops (time: 0.820032 s) $ OMP_NUM_THREADS=8 OMP_PLACES="{0},{4},{8},{12},{16},{20},{24},{28}" OMP_PROC_BIND=TRUE ./a.out 22.4853 Gflops (time: 6.11239 s) Without the issue, the performance should reach up to 550~600 Gflops on this machine. But if the issue occurs, a performance of only 23 Gflops is obtained. More details can be seen here: https://github.com/xianyi/OpenBLAS/issues/2380 .