On Mon, Jul 11, 2011 at 11:39 PM, Mark Hahn <h...@mcmaster.ca> wrote: >> http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html > > since this isn't an SGE list, I don't want to pursue an off-topic too far,
Hi Mark, I think a lot of this will apply to non-SGE batch schedulers -- in fact Torque will support hwloc in a future release. And all mature batch systems (eg. LSF, SGE, SLURM) have some sort of CPU set support for many years, but now this feature is more important as the interaction of different hardware layers impacts the performance more as more cores are added per socket. > but out of curiosity, does this make the scheduler topology aware? > that is, not just topo-aware binding, but topo-aware resource allocation? > you know, avoid unnecessary resource contention among the threads belonging > to multiple jobs that happen to be on the same node. You can tell SGE (now: Grid Scheduler) how you want to allocate hardware resource, but then different hardware architectures & program behaviors can introduce interactions that will cause different performance impact. For example, a few years ago while I was still working for a large UNIX system vendor, I found that a few SPEC OMP benchmarks run faster when the threads are closer to each other (even when sharing the same core by running in SMT mode), while most benchmarks benefit from more L2/L3 caches & memory bandwidth (I'm talking about the same thread count for both cases). But it is hard even as a compiler developer to find out how to choose the optimal thread allocation -- even with high-level array access pattern information & memory bandwidth models available at compilation time. For batch systems, we don't have as much info as the compiler. While we can profile systems on the fly by PAPI, I doubt we will go that route in the near future. So, that means we need the job submitter to tell us what he wants -- in SGE/OGS, we have "qsub -binding striding:<amount>:<step-size>", which means you will need to benchmark the code and see how the code interacts with the hardware, and see whether it runs better with more L2/L3/memory bandwidth (meaning step-size >= 2), or "qsub -binding linear", which means the job will get the core by itself. http://wikis.sun.com/display/gridengine62u5/Using+Job+to+Core+Binding > large-memory processes > not getting bound to a single memory node. packing both small and > large-memory processes within a node. etc? For memory nodes, a call to numactl should be able to handle most use-cases. Rayson > > thanks, mark hahn. > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf