Charlie Not so. Jeff Squyres told me the same thing, but based on my experience it is not the TCP implementation in OpenMPI that's so bad; it's the algorithms for the collectives that makes the biggest difference. LAM is about the same as OpenMPI for point-to-point. See below for a 2 node bidirectional edge exchange. Results are 1-way throughput in Mbytes/sec. There is a difference at 64KBytes because OMPI is switching to Rendezvous protocol while LAM was set to switch at 128K. But other than that the performance is similar.
MPICH has a worse TCP implementation than either of these. But its collective algorithms are the best I have tested. Particularly All_reduce which gets used a lot. So in some applications MPICH can edge out OpenMPI or LAM for large numbers of cpus. The optimum All_reduce has an asymptotic time proportional to 2M where M is the message length. LAM and OMPI typically use a binary tree, which is 2M*log_2(Np). This makes a substantial difference for large numbers of processors (Np > 32). MPICH has a near-optimum algorithm that scales asymptotically as 4M. So MPICH + GAMMA lays waste to any TCP implementation including LAM. I have a lot of results for LAM as well, but not quite as complete as for OpenMPI. In general I found OpenMPI v1.2 gave similar application benchmarks to LAM, which is why I didn't bother to report them; MPI/GAMMA is much faster than either of them. OpenMPI had horrible collectives in v1.0 and v1.1. I got dreadful All_reduce performance with TCP + OMPI (throughputs less than 0.1 Mbytes/sec). v1.2 is much better than v1.1 but still poor in comparison to MPICH. The OpenMPI developers chose to make the optimization of the collectives very flexible, but there is no decent interface for handling the optimization yet. Also the best algorithms (for instance for All_reduce) are not yet implemented as far as I can tell. My attempts at tuning OpenMPI collectives were not very successful. Bottom line is OpenMPI has improved the collectives significantly in v1.2. I don't see significant differences then between OMPI benchmarks and LAM benchmarks. But MPI/GAMMA is much better than any TCP implementation, both for network benchmarks and for applications. Tony Size LAM OMPI 1 8.4 8.2 2 15.0 15.3 4 21.6 21.8 8 36.0 34.7 16 54.4 53.7 32 74.9 73.0 64 90.8 45.0 128 51.5 51.5 256 55.9 55.7 512 58.3 62.0 1024 61.0 61.0 -----Original Message----- From: Charlie Peck [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 29, 2006 7:31 AM To: Tony Ladd Subject: Re: [Beowulf] Parallel application performance tests On Nov 28, 2006, at 1:27 PM, Tony Ladd wrote: > I have recently completed a number of performance tests on a Beowulf > cluster, using up to 48 dual-core P4D nodes, connected by an Extreme > Networks Gigabit edge switch. The tests consist of single and multi- > node application benchmarks, including DLPOLY, GROMACS, and VASP, as > well as > specific tests of network cards and switches. I used TCP sockets with > OpenMPI v1.2 and MPI/GAMMA over Gigabit ethernet. MPI/GAMMA leads to > significantly better scaling than OpenMPI/TCP in both network tests > and in > application benchmarks. It turns-out that the TCP binding for OpenMPI is known to have problems. They have been focusing on the proprietary high-speed interconnects and haven't had time to go back and improve the performance of TCP binding yet. If you run LAM/TCP you will notice a significant difference by comparison. charlie _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
