https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71781

            Bug ID: 71781
           Summary: Severe performance degradation of short parallel for
                    loop on hardware with lots of cores
           Product: gcc
           Version: 5.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vflashm at gmail dot com
                CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Created attachment 38842
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38842&action=edit
Test source

Description:

My code has a short fixed-size parallel for in it.
I'm running it on 40-thread hardware.
On default settings it works millions times slower than non-parallel version.
It works just fine if I explicitly specify number of threads.


Test hardware:

Xeon E7-4870 server.
2 CPUs, 20 cores, 40 threads.
Linux 2.6.32-642.1.1.el6.x86_64 #1 SMP


Test source:

int main() {
    for (int i = 0; i < 1000; ++i) {
        #pragma omp parallel for
        for (int j = 0; j < 4; ++j) {
        }
    }
}


Command lines:

g++ -fopenmp -Wall test.cpp
for i in $(seq 50); do echo -n "threads=$i "; OMP_NUM_THREADS=$i /usr/bin/time
-f'time=%e' ./a.out; done

Results:
threads=1 time=0.00
threads=2 time=0.00
threads=3 time=0.00
threads=4 time=0.00
threads=5 time=0.00
threads=6 time=0.00
threads=7 time=0.00
threads=8 time=0.00
threads=9 time=0.00
threads=10 time=0.00
threads=11 time=0.01
threads=12 time=0.01
threads=13 time=0.01
threads=14 time=0.01
threads=15 time=0.01
threads=16 time=0.01
threads=17 time=0.01
threads=18 time=0.01
threads=19 time=0.01
threads=20 time=0.01
threads=21 time=0.01
threads=22 time=0.01
threads=23 time=0.02
threads=24 time=0.01
threads=25 time=0.01
threads=26 time=0.01
threads=27 time=0.01
threads=28 time=0.01
threads=29 time=0.01
threads=30 time=0.02
threads=31 time=0.02
threads=32 time=0.02
threads=33 time=0.02
threads=34 time=0.02
threads=35 time=0.02
threads=36 time=0.02
threads=37 time=0.03
threads=38 time=2.49   <== HERE
threads=39 time=3.84   <== HERE
threads=40 time=3.85   <== HERE
threads=41 time=0.14
threads=42 time=0.14
threads=43 time=0.16
threads=44 time=0.15
threads=45 time=0.14
threads=46 time=0.16
threads=47 time=0.16
threads=48 time=0.16
threads=49 time=0.18
threads=50 time=0.17

Note that when count of threads comes close to hardware capacity, doing 1000
iterations starts to take about 4 seconds, which is millions times slower than
it should be.


Reproduced with:
g++ 4.8.1
g++ 4.9.1
g++ 4.9.3
g++ 5.3.0


Apparent reasons:

In short: barrier thrashing in spinlock.

Profiling shows that 50/50% cpu is spent in:

gomp_barrier_wait_end
gomp_team_barrier_wait_end

These appear to be barriers before/after task execution and after
gomp_finish_task.

Most of the threads appear to be spinlocking.
I'm not quite sure why it is SO MUCH slower.
My guess is that 36 spinlocking threads waiting for 4 working threads starve
them on hardware with only 20 real cores (not multi-threaded).

I checked strace too.
With 4 threads it consists of 2000 FUTEX_WAKE calls, which is expected.
For 40 threads it's usually something like 350 FUTEX_WAIT and 850 FUTEX_WAKE
calls. I don't really understand how can there be less than 1000 FUTEX_WAKE
calls (at least gomp_barrier_wait_end seems to be unconditional), so I might be
missing something about how barriers implemented in gomp.

Note that performance gets MUCH better at 41 threads.
That's because having more threads than hardware thread count changes spinlock
behavior. Manually setting GOMP_SPINCOUNT=10 fixes performance as well.



Solutions:

1) quick and easy: 
In gomp_parallel_loop_start calculate count of iterations and pass it to
gomp_resolve_num_threads.

2) better, but changes internal gomp interfaces:
Calculate count of iterations at compile time when possible, then pass it to
gomp_parallel_loop_start, then gomp_resolve_num_threads. If not possible, then
calculate it in runtime.

3) verify spinlock behavior:
It might be better to reduce spinlock iterations count when count of managed
threads is higher than count of real cores, and reduce it even further if it is
higher than count of hardware threads (hyper-threaded). Not sure about that,
though.

Reply via email to