At 08:39 27.06.2008, Patrick Geoffray wrote:
Hi Hakon,

Håkon Bugge wrote:
This is information we're using to optimize how pnt-to-pnt communication is implemented. The code-base involved is fairly complicated and I do not expect resource management systems to cope with it.

Why not ? It's its job to know the resources it has to manage. The resource manager has more information than you, it does not have to detect at runtime for each job, and it can manage cores allocation across jobs. You cannot expect the granularity of the allocation to stay at the node level with the core count increasing.

This raises two questions: a) Which job schedulers are able to optimize placement on cores thereby _improving_ application performance? b) which job schedulers are able to deduct which cores share a L3 cache and are situated on the same socket?

... and a clarification. Systems using Scali MPI Connect _can_ have finer granularity than the node level; the job scheduler must just not oversubscribe. Assignment of cores to processes is _dynamically_ done by Scali MPI Connect.


If the MPI implementation does the spawning, it should definitively have support to enforce core affinity (most do AFAIK). However, core affinity should be dictated by the scheduler. Heck, the MPI implementation should not do the spawning in the first place.

Historically, resource managers have been pretty dumb. These days, there is enough competition in this domain to expect better.

I am fine with the schedulers dictating it, but not if the performance is hurt.


Håkon


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to