https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71781

Jonathan Hogg <jhogg41 at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jhogg41 at gmail dot com

--- Comment #1 from Jonathan Hogg <jhogg41 at gmail dot com> ---
We also see similar behaviour when running task-based (as opposed to parallel
for) code. When the number of tasks is much smaller than the number of cores,
most time is spent in libgomp spinning. Presumably there is too much contention
on the work-queue lock. We're running on 28 real cores (2x14 core intel
haswell-EP chips).

If we look out our task profile, we see very little of the time is spent inside
our task code, and this is confirmed by profile data from perf:
  27.60%  spral_ssids  libgomp.so.1.0.0      [.] gomp_mutex_lock_slow
   6.96%  spral_ssids  libgomp.so.1.0.0      [.] gomp_team_barrier_wait_end
   3.78%  spral_ssids  [kernel.kallsyms]     [k] _spin_lock_irq
   2.91%  spral_ssids  [kernel.kallsyms]     [k] smp_invalidate_interrupt
   2.21%  spral_ssids  spral_ssids           [.] __CreateCoarseGraphNoMask
   2.18%  spral_ssids  [kernel.kallsyms]     [k] _spin_lock
   2.05%  spral_ssids  libmkl_avx2.so        [.] mkl_blas_avx2_dgemm_kernel_0
   1.99%  spral_ssids  spral_ssids           [.] __FM_2WayNodeRefine_OneSided
   1.78%  spral_ssids  libgomp.so.1.0.0      [.] gomp_sem_wait_slow
   1.64%  spral_ssids  libc-2.12.so          [.] __GI_____strtod_l_internal

Here's an example of what we're seeing:

Small problems (much less work than cores):
4 cores / 28 cores times in seconds
0.02 / 0.17
0.20 / 0.60
0.20 / 0.58
0.14 / 0.63
0.75 / 2.37

Bigger problems (sufficient work exists):
4 cores / 28 cores times in seconds
48.52 / 22.16
153.49 / 61.77
140.89 / 54.51
189.75 / 71.43

Reply via email to