On Sat, Feb 14, 2009 at 6:43 PM, David Mathog <mat...@caltech.edu> wrote:
> Tiago Marques <a28...@ua.pt> > > > > I've been trying to get the best performance on a small cluster we > have here > > at University of Aveiro, Portugal, but I've not been enable to get most > > software to scale to more than one node. > > <SNIP> > > > The problem with this setup is that even calculations that take more > than 15 > > days don't scale to more than 8 cores, or one node. Usually performance > is > > lower with 16cores, 12 cores, than with just 8. From what I've been > reading, > > I should be able to scale fine at least till 16 cores and 32 for some > > software. > > <SNIP> > > > > I tried with Gromacs to have two nodes using one processor each, to > check if > > 8 cores were stressing the GbE too much, and the performance dropped too > > much compared with running two CPUs on the same node. > > Lots of possibilities here. Most of them are probably coming down to the > code not being written to make good use of a cluster environment, and/or > there not being any way to do that (single threaded code with a lot of > unpredictable branching). > > For Gromacs I suggest you ask on that mailing list. My recollection is > that it was known to scale poorly, but that was a couple of years ago, > and maybe they have improved it since then. If it doesn't scale you can > always get more throughput by running one independent job on each of > your nodes, using local storage to avoid network contention to the file > server. It may take 15 days to finish a run, but at least you'll have N > times more work completed. Running N independent jobs will give you at > least as much throughput as running 1 job on N cores. Admittedly it is > nice to have the results in 1/Nth the time. > Already did that, not too many helpful people on Gromacs list... They just told me to wait for 4.0 version, which I did, which scales better, though still not as I hoped. Were already running a single job per node for months but it would be good to have the chance to run jobs faster, sometimes it's needed. > Some of what you may be seeing with poorer performance on more cores on > one node is probably related to the effect on memory access, especially > through cache. Code that can go in and out of cache runs much faster > than anything which has to go to main memory, and as soon as you run two > competing (which depends on architecture) processes you may find that > the two programs are throwing each other's data out of any shared cache, > which can result in dramatic slowdowns. > > Give gprof a shot too. You want to see where your code is spending most > of its time. If it spends 95% of its time in routines with no network > IO, then the network is likely not your issue. And vice versa. > I have thought of that, but I didn't manage to do it on the more important codes. It compiles but just doesn't spit out the profiling output. I have used "iftop" to measure network usage and it's probably around 300-400Mbit/s, so I was poiting the problem at latency, throughput seems fine. While copying files with "scp", I can get 93MB/s. > > unexpected for me, since the benchmarks I've seen on Gromacs website > state > > that I should be able to have 100% scaling on this case, sometimes more. > > Contact the person who said that, get the exact conditions, and see if > you can replicate them. You might have a network issue, but unless you > are comparing apples to apples it may be hard to figure it out. True. Thanks for the help. I must ask, doesn't anybody on this list run like 16 cores on two nodes well, for a code and job that completes like in a week? Or most code that gets done in a week/two weeks only scales with InfiniBand and the like? For like 99% of the cases. Best regards, Tiago Marques > > > Regards, > > David Mathog > mat...@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf