Doug I think the difference is mainly in the collective communications, which are improving and are about at LAM level in v1.2. Overall I don't see a big difference between LAM and OpenMPI
Where can I find the driver settings for the Intel NICS to get the latency down? I would like to try benchmarking the Intel NICS with reduced latency TCP. That being said, I think the very good scaling I see with GAMMA is due to its simple and highly efficient flow control. You have tested GAMMA so you will have observed how regular it is; this becomes important with large numbers of nodes. I think this is the source of the good scaling. These are scalable applications but not of the "Embarrassingly Parallel" variety. Refering to my post of the DLPOLY benchmark: with 32 cpus and TCP it takes 82 secs, and with 64 cpus it takes 84 secs. So NOT an EP application. On the other hand MPI/GAMMA takes 56 and 34 secs respectively. This is better scaling than our HPC center's IB cluster which was 50 and 42 secs. There are a number of reasons why the IB cluster is not performing as well as it should; but this is not so atypical of supercompter center environments, which impose a number of constraints that can affect performance. I discuss my interpretation of some of these issues on my website. I was skimming over your website today. I also found that HPL is not faster with GAMMA than with TCP. An HPCC developer told me HPL overlaps calculation and communication; GAMMA polls so it always utilizes 100% of the cpu and cannot take advantage of this. I totally agree with your comment about supercomputer utilization. A huge amount of money is being spent to make these things scale to enormous numbers of processors; just for a couple of HPL runs. Then it gets Balkanized down to groups of 16-64 cpus most of the time. BTW we get 385 Gflops on HPL from a rack of 48 dual-core P4's: cost per rack was $51K including switches and misc hardware (with good discounts from vendors). So $132 per Gflop. Tony -----Original Message----- From: Douglas Eadline [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 29, 2006 9:30 AM To: Tony Ladd Cc: [email protected] Subject: Re: [Beowulf] Parallel application performance tests Tony, Interesting work to say the least. A few comments. The TCP implementation of OpenMPI is known to be sub-optimal (i.e. it can perform poorly in some situations). Indeed, using LAM over TCP usually provides much better numbers. I have found that the single socket Pentium D (now called Xeon 3000 series) provides great performance. The big caches help quite a bit, plus it is a single socket (more sockets means more memory contention). That said, I believe for the right applications GigE can be very cost effective. The TCP latency for the Intel NICs is actually quite good (~28 us) when the driver options are set properly and GAMMA takes it to the next level. I have not had time to read your report in it's entirety, but I noticed your question about how GigE+GAMMA can do as well as Infiniband. Well if the application does not need the extra throughput then there will be no improvement. The same way that the EP test in the NAS parallel suite is about the same for every interconnect (EP stands for Embarrassing Parallel) IS (Integer Sort) on the other hand is very sensitive to latency. Now, with multi-socket/multi-core becoming the norm, better throughput will become more important. I'll have some tests posted before to long to show the difference on dual-socket quad-core systems. Finally, OpenMPI+GAMMA would be really nice. The good news is OpenMPI is very modular. Keep up the good work. -- Doug > I have recently completed a number of performance tests on a Beowulf > cluster, using up to 48 dual-core P4D nodes, connected by an Extreme > Networks Gigabit edge switch. The tests consist of single and > multi-node application benchmarks, including DLPOLY, GROMACS, and > VASP, as well as specific tests of network cards and switches. I used > TCP sockets with OpenMPI v1.2 and MPI/GAMMA over Gigabit ethernet. > MPI/GAMMA leads to significantly better scaling than OpenMPI/TCP in > both network tests and in application benchmarks. The overall > performance of the MPI/GAMMA cluster on a per cpu basis was found to > be comparable to a dual-core Opteron cluster with an Infiniband > interconnect. The DLPoly benchmark showed similar scaling > to those reported for an IBM p690. The performance using TCP was typically > a > factor of 2 less in these same tests. Here are a couple of examples from > the > DLPOLY benchmark 1 (27,000 NaCl ions) > > CPUS OpenMPI/TCP (P4D) MPI/GAMMA (P4D) OpenMPI/Infiniband (Opteron > 275) > > 1 1255 1276 > 1095 > 2 614 635 > 773 > 4 337 328 > 411 > 8 184 173 > 158 > 16 125 95 > 84 > 32 82 56 > 50 > 64 84 34 > 42 > > A detailed write up can be found at: > http://ladd.che.ufl.edu/research/beoclus/beoclus.htm > > > > Tony Ladd > Chemical Engineering > University of Florida > > ------------------------------- > Tony Ladd > Chemical Engineering > University of Florida > PO Box 116005 > Gainesville, FL 32611-6005 > > Tel: 352-392-6509 > FAX: 352-392-9513 > Email: [EMAIL PROTECTED] > Web: http://ladd.che.ufl.edu > > > _______________________________________________ > Beowulf mailing list, [email protected] > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > !DSPAM:456c9566180417110611695! > -- Doug _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
