Tony, Interesting work to say the least. A few comments. The TCP implementation of OpenMPI is known to be sub-optimal (i.e. it can perform poorly in some situations). Indeed, using LAM over TCP usually provides much better numbers.
I have found that the single socket Pentium D (now called Xeon 3000 series) provides great performance. The big caches help quite a bit, plus it is a single socket (more sockets means more memory contention). That said, I believe for the right applications GigE can be very cost effective. The TCP latency for the Intel NICs is actually quite good (~28 us) when the driver options are set properly and GAMMA takes it to the next level. I have not had time to read your report in it's entirety, but I noticed your question about how GigE+GAMMA can do as well as Infiniband. Well if the application does not need the extra throughput then there will be no improvement. The same way that the EP test in the NAS parallel suite is about the same for every interconnect (EP stands for Embarrassing Parallel) IS (Integer Sort) on the other hand is very sensitive to latency. Now, with multi-socket/multi-core becoming the norm, better throughput will become more important. I'll have some tests posted before to long to show the difference on dual-socket quad-core systems. Finally, OpenMPI+GAMMA would be really nice. The good news is OpenMPI is very modular. Keep up the good work. -- Doug > I have recently completed a number of performance tests on a Beowulf > cluster, using up to 48 dual-core P4D nodes, connected by an Extreme > Networks Gigabit edge switch. The tests consist of single and multi-node > application benchmarks, including DLPOLY, GROMACS, and VASP, as well as > specific tests of network cards and switches. I used TCP sockets with > OpenMPI v1.2 and MPI/GAMMA over Gigabit ethernet. MPI/GAMMA leads to > significantly better scaling than OpenMPI/TCP in both network tests and in > application benchmarks. The overall performance of the MPI/GAMMA cluster > on > a per cpu basis was found to be comparable to a dual-core Opteron cluster > with an Infiniband interconnect. The DLPoly benchmark showed similar > scaling > to those reported for an IBM p690. The performance using TCP was typically > a > factor of 2 less in these same tests. Here are a couple of examples from > the > DLPOLY benchmark 1 (27,000 NaCl ions) > > CPUS OpenMPI/TCP (P4D) MPI/GAMMA (P4D) OpenMPI/Infiniband (Opteron > 275) > > 1 1255 1276 > 1095 > 2 614 635 > 773 > 4 337 328 > 411 > 8 184 173 > 158 > 16 125 95 > 84 > 32 82 56 > 50 > 64 84 34 > 42 > > A detailed write up can be found at: > http://ladd.che.ufl.edu/research/beoclus/beoclus.htm > > > > Tony Ladd > Chemical Engineering > University of Florida > > ------------------------------- > Tony Ladd > Chemical Engineering > University of Florida > PO Box 116005 > Gainesville, FL 32611-6005 > > Tel: 352-392-6509 > FAX: 352-392-9513 > Email: [EMAIL PROTECTED] > Web: http://ladd.che.ufl.edu > > > _______________________________________________ > Beowulf mailing list, [email protected] > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > !DSPAM:456c9566180417110611695! > -- Doug _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
