On Thu, 2007-01-18 at 23:12 -0500, Mark Hahn wrote: > >> "The Case of the Missing Supercomputing Performance" > > > > I wondered if you were talking about that paper but it's from lanl not > > sandia, it should be essential reading for everyone working with large > > clusters. > > I love this paper. but it's critical to realize that it's all about > very large, very tightly-coupled, frequent-global-collective-using > applications. you could easily have a 2k-node cluster (I'd call it large) > dedicated to 1-to-100-core jobs and gleefully ignore jitter. or be running > an 8k-core montecarlo that never needs any global synchronization, etc. > > I'd actually love to see data on whether jitter affects apps > other than ah, "stockpile stewardship" ;)
In my experience yes. Clearly some apps are more susceptible than others. At one extreme even embarrassingly parallel apps can suffer from noise if the job is only considered complete when the last result is returned. Any app that performs synchronisation between nodes (even implicitly with point-to-point comms) will cause delays caused by noise to propagate across the cluster and unfortunately because of the way these delays combine the effect gets quite defined at size. Consider for example a 64 node cluster with one CPU per node, on this cluster there is a deamon which wakes up once a minute, spins for a second and goes back to sleep. Running a single process job you can expect to see 59/60 seconds elapsed used by the job. You probably don't worry about this. Now assume that you have a 64 way job which performs a global barrier every two seconds, now in that two second timeframe statistically *at least one* node will be affected by noise so the compute time for the process on that node is two seconds for the application and one for the deamon. Each timestep now takes three seconds to achieve two seconds worth of compute time, that's 33% of your compute time down the drain. In reality the figures I've given here are pessimistic, Linux doesn't have *that much* jitter so smaller clusters are by-and-large unaffected however it's a fairly common problem on 1024 + way clusters. In answer to a previous post about using extra CPUS/cores to alleviate this problem it's not a new idea, IIRC PSC were doing this six or seven years ago, I'd be interested to see if hyperthreading helps the situation, it's almost always turned of and any cluster over 32 CPU's but it might be advantageous to enable it and use something like cpusets to bind the application to real CPU's whilst letting the resource manager/Ganglia/sendmail twiddle it's thumbs on the other virtual 20% CPU. Ashley, _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf