At 08:39 27.06.2008, Patrick Geoffray wrote:
Hi Hakon,
Håkon Bugge wrote:
This is information we're using to optimize how
pnt-to-pnt communication is implemented. The
code-base involved is fairly complicated and I
do not expect resource management systems to cope with it.
Why not ? It's its job to know the resources it
has to manage. The resource manager has more
information than you, it does not have to detect
at runtime for each job, and it can manage cores
allocation across jobs. You cannot expect the
granularity of the allocation to stay at the
node level with the core count increasing.
This raises two questions: a) Which job
schedulers are able to optimize placement on
cores thereby _improving_ application
performance? b) which job schedulers are able to
deduct which cores share a L3 cache and are situated on the same socket?
... and a clarification. Systems using Scali MPI
Connect _can_ have finer granularity than the
node level; the job scheduler must just not
oversubscribe. Assignment of cores to processes
is _dynamically_ done by Scali MPI Connect.
If the MPI implementation does the spawning, it
should definitively have support to enforce core
affinity (most do AFAIK). However, core affinity
should be dictated by the scheduler. Heck, the
MPI implementation should not do the spawning in the first place.
Historically, resource managers have been pretty
dumb. These days, there is enough competition in this domain to expect better.
I am fine with the schedulers dictating it, but not if the performance is hurt.
Håkon
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf