https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71781
Jonathan Hogg <jhogg41 at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jhogg41 at gmail dot com --- Comment #1 from Jonathan Hogg <jhogg41 at gmail dot com> --- We also see similar behaviour when running task-based (as opposed to parallel for) code. When the number of tasks is much smaller than the number of cores, most time is spent in libgomp spinning. Presumably there is too much contention on the work-queue lock. We're running on 28 real cores (2x14 core intel haswell-EP chips). If we look out our task profile, we see very little of the time is spent inside our task code, and this is confirmed by profile data from perf: 27.60% spral_ssids libgomp.so.1.0.0 [.] gomp_mutex_lock_slow 6.96% spral_ssids libgomp.so.1.0.0 [.] gomp_team_barrier_wait_end 3.78% spral_ssids [kernel.kallsyms] [k] _spin_lock_irq 2.91% spral_ssids [kernel.kallsyms] [k] smp_invalidate_interrupt 2.21% spral_ssids spral_ssids [.] __CreateCoarseGraphNoMask 2.18% spral_ssids [kernel.kallsyms] [k] _spin_lock 2.05% spral_ssids libmkl_avx2.so [.] mkl_blas_avx2_dgemm_kernel_0 1.99% spral_ssids spral_ssids [.] __FM_2WayNodeRefine_OneSided 1.78% spral_ssids libgomp.so.1.0.0 [.] gomp_sem_wait_slow 1.64% spral_ssids libc-2.12.so [.] __GI_____strtod_l_internal Here's an example of what we're seeing: Small problems (much less work than cores): 4 cores / 28 cores times in seconds 0.02 / 0.17 0.20 / 0.60 0.20 / 0.58 0.14 / 0.63 0.75 / 2.37 Bigger problems (sufficient work exists): 4 cores / 28 cores times in seconds 48.52 / 22.16 153.49 / 61.77 140.89 / 54.51 189.75 / 71.43