Hello list, I want to benchmark on-chip performance of message passing from one process running on one core to another process running on another core (test setup would be OpenMPI 1.3.2 with a 4-socket Dunnington, processes will be pinned to a specific core). I do know about other, more suitable programming models on such a shared memory system, I really just want to have a look at MPI.
But I'm a beginner when it comes to benchmarking at that level and wanted to ask you if you could point me to some "first steps"-docs. Like how to prevent hardware prefetching getting in the way of measuring the worst-case performance when sending big arrays (force fetching random locations?), how to recognize TLB hits/misses in the results, etc. Currently I'm looking over the source code of the SM-BTL in OpenMPI and will try to get some scheme of the Dunnington to better understand it's architecture (still searching ;-) ). Thank you very much, Marcel _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf