Re: [Beowulf] Again about NUMA (numactl and taskset)

Håkon Bugge Fri, 18 Jul 2008 07:59:22 -0700

At 08:39 27.06.2008, Patrick Geoffray wrote:

Hi Hakon,
Håkon Bugge wrote:
This is information we're using to optimize howpnt-to-pnt communication is implemented. Thecode-base involved is fairly complicated and Ido not expect resource management systems to cope with it.
Why not ? It's its job to know the resources ithas to manage. The resource manager has moreinformation than you, it does not have to detectat runtime for each job, and it can manage coresallocation across jobs. You cannot expect thegranularity of the allocation to stay at thenode level with the core count increasing.

This raises two questions: a) Which jobschedulers are able to optimize placement oncores thereby _improving_ applicationperformance? b) which job schedulers are able todeduct which cores share a L3 cache and are situated on the same socket?

... and a clarification. Systems using Scali MPIConnect _can_ have finer granularity than thenode level; the job scheduler must just notoversubscribe. Assignment of cores to processesis _dynamically_ done by Scali MPI Connect.

If the MPI implementation does the spawning, itshould definitively have support to enforce coreaffinity (most do AFAIK). However, core affinityshould be dictated by the scheduler. Heck, theMPI implementation should not do the spawning in the first place.
Historically, resource managers have been prettydumb. These days, there is enough competition in this domain to expect better.


I am fine with the schedulers dictating it, but not if the performance is hurt.


Håkon


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Again about NUMA (numactl and taskset)

Reply via email to