Re: [Beowulf] [External] numad?

Prentice Bisbal via Beowulf Tue, 18 Jan 2022 13:50:20 -0800

Just to add to my earlier comment below. I think numad is somethingthat's really meant for non-HPC environments where latency-hiding ismore important than all-out performance. Kinda like hyperthreading - onHPC workloads, it provides marginal improvement at best, but is veryhelpful on non-HPC workloads (or so I've been told - I have no firsthandprofessional experience with hyperthreading)


Prentice


On 1/18/22 2:56 PM, Prentice Bisbal wrote:

Mike,
I turn it off. When I had it on, it would cause performance to tank. Doing some basic analysis, it appeared numad would move all the workto a single core, leaving all the others idle. Without knowing theinner workings of numad, my guess is that it saw the processesaccessing the same region of memory, so moved all the processes to thecore "closest" to that memory.
I didn't do any in-depth analysis, but turning off numad definitelyfixed that problem. The problem first appeared with a user code, and Iwas able to reproduce it with HPL. It took 10 - 20 minutes for numadto start migrating processes to the same core, so smaller "test" jobsdidn't trigger the behavior, causing first attempts at reproducing itwere unsuccessful. It wasn't until I ran "full" HPL tests on a nodethat I was to reproduce the problem.
I think I used turbostat or something like that to watch the loadand/or processor freqs on the individual cores.
Prentice

On 1/18/22 1:18 PM, Michael Di Domenico wrote:
does anyone turn-on/off numad on their clusters?  I'm running RHEL7.9
on Intel CPU's and seeing a heavy performance impact on MPI jobs when
running numad.

diagnosis is pretty prelim right now, so i'm light on details. when
running numad i'm seeing MPI jobs stall while numad pokes at the job.
the stall is notable, like 10-12 seconds

it's particularly interesting because if one rank stalls while numad
runs, the others wait.  once it frees they all continue, but then
another rank gets hit, so i end up seeing this cyclic stall

like i said i'm still looking into things, but i curious what
everyone's take on numa is.  my consensus is we probably don't even
really need it since slurm/openmpi should be handling process
placement anyhow

thoughts?
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visithttps://beowulf.org/cgi-bin/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] [External] numad?

Reply via email to