https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71781
Bug ID: 71781 Summary: Severe performance degradation of short parallel for loop on hardware with lots of cores Product: gcc Version: 5.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: vflashm at gmail dot com CC: jakub at gcc dot gnu.org Target Milestone: --- Created attachment 38842 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38842&action=edit Test source Description: My code has a short fixed-size parallel for in it. I'm running it on 40-thread hardware. On default settings it works millions times slower than non-parallel version. It works just fine if I explicitly specify number of threads. Test hardware: Xeon E7-4870 server. 2 CPUs, 20 cores, 40 threads. Linux 2.6.32-642.1.1.el6.x86_64 #1 SMP Test source: int main() { for (int i = 0; i < 1000; ++i) { #pragma omp parallel for for (int j = 0; j < 4; ++j) { } } } Command lines: g++ -fopenmp -Wall test.cpp for i in $(seq 50); do echo -n "threads=$i "; OMP_NUM_THREADS=$i /usr/bin/time -f'time=%e' ./a.out; done Results: threads=1 time=0.00 threads=2 time=0.00 threads=3 time=0.00 threads=4 time=0.00 threads=5 time=0.00 threads=6 time=0.00 threads=7 time=0.00 threads=8 time=0.00 threads=9 time=0.00 threads=10 time=0.00 threads=11 time=0.01 threads=12 time=0.01 threads=13 time=0.01 threads=14 time=0.01 threads=15 time=0.01 threads=16 time=0.01 threads=17 time=0.01 threads=18 time=0.01 threads=19 time=0.01 threads=20 time=0.01 threads=21 time=0.01 threads=22 time=0.01 threads=23 time=0.02 threads=24 time=0.01 threads=25 time=0.01 threads=26 time=0.01 threads=27 time=0.01 threads=28 time=0.01 threads=29 time=0.01 threads=30 time=0.02 threads=31 time=0.02 threads=32 time=0.02 threads=33 time=0.02 threads=34 time=0.02 threads=35 time=0.02 threads=36 time=0.02 threads=37 time=0.03 threads=38 time=2.49 <== HERE threads=39 time=3.84 <== HERE threads=40 time=3.85 <== HERE threads=41 time=0.14 threads=42 time=0.14 threads=43 time=0.16 threads=44 time=0.15 threads=45 time=0.14 threads=46 time=0.16 threads=47 time=0.16 threads=48 time=0.16 threads=49 time=0.18 threads=50 time=0.17 Note that when count of threads comes close to hardware capacity, doing 1000 iterations starts to take about 4 seconds, which is millions times slower than it should be. Reproduced with: g++ 4.8.1 g++ 4.9.1 g++ 4.9.3 g++ 5.3.0 Apparent reasons: In short: barrier thrashing in spinlock. Profiling shows that 50/50% cpu is spent in: gomp_barrier_wait_end gomp_team_barrier_wait_end These appear to be barriers before/after task execution and after gomp_finish_task. Most of the threads appear to be spinlocking. I'm not quite sure why it is SO MUCH slower. My guess is that 36 spinlocking threads waiting for 4 working threads starve them on hardware with only 20 real cores (not multi-threaded). I checked strace too. With 4 threads it consists of 2000 FUTEX_WAKE calls, which is expected. For 40 threads it's usually something like 350 FUTEX_WAIT and 850 FUTEX_WAKE calls. I don't really understand how can there be less than 1000 FUTEX_WAKE calls (at least gomp_barrier_wait_end seems to be unconditional), so I might be missing something about how barriers implemented in gomp. Note that performance gets MUCH better at 41 threads. That's because having more threads than hardware thread count changes spinlock behavior. Manually setting GOMP_SPINCOUNT=10 fixes performance as well. Solutions: 1) quick and easy: In gomp_parallel_loop_start calculate count of iterations and pass it to gomp_resolve_num_threads. 2) better, but changes internal gomp interfaces: Calculate count of iterations at compile time when possible, then pass it to gomp_parallel_loop_start, then gomp_resolve_num_threads. If not possible, then calculate it in runtime. 3) verify spinlock behavior: It might be better to reduce spinlock iterations count when count of managed threads is higher than count of real cores, and reduce it even further if it is higher than count of hardware threads (hyper-threaded). Not sure about that, though.