Hi Bill

I've tested fft's rather extensively and run other codes that require a transpose. In my experience, a well tuned gig-e network is capable of giving speed up, though not necessarily scaling that well. The most important thing is that you have full bisection bandwidth. Anything less will reduce your scaling. That is, if you use gig-e you can't trunk switches, you will need to stay within a single switch. Typically, I've seen a 16 cpu job on gig-e gig about a 10 times speedup. Of course, it is processor/memory/nic dependant.

I've also run fft's on Quadrics Elan 3/4, IBM hps, and SGI Numalink 4. Since these are considerably higher bandwidth network they perform much better. On a 16cpu job I've seen around 14 times speed up on these higher bandwidth networks.

As the size increases (say 256 cpu's) the networks that maintain full bisection bandwidth scale the best. There are very few reasonably prices gig-e switches that maintain full bisection bandwidth at 256 cpu's, while Quadrics and HPS do (though their starting price is high, at the larger system sizes, they become a realistic proposition). Numalink falls away a little due to the weird network topology (dual plane quad bristle fat tree) which has drops in network connectivity/cpu as the system gets larger.

If you want to go with gig-e a few things to be aware of:

*The nic matters (pro1000MT's give 10-15% better performance that pro1000T's)

* Go with single cpu nodes - higher per cpu network bandwidth

* If you get dual core cpu's, treat it as a single core node (allow the 2nd core to do all the tcp stuff)

I've played around with multiply connected nodes (nodes that have dual ported nics) and the 2'nd nic doesn't give you much (10-15%) and requires a fair bit of stuffing around to get it working well. I think you would be better of running your global fs and other services over 1 nic and your mpi traffic over the other. At least this way, your fs and services shouldn't be stealing your bandwidth.

You may even try running mpi-gamma on the 2nd nic, which should give you better bandwidth, hence better scaling (I haven't tried this).

If you want real measured numbers, drop me a personal email.

Stu.



On 01/03/2006, at 2:26, Bill Rankin wrote:

Hey gang,

I know that in the past, multidimensional FFTs (in my case, 3D) have posed a big challenge in getting them running well on clusters, mainly in the areas of scalability. This is somewhat due to the need for an All2All communication step in the processing (although there seem to be some alternative approaches here).

There is a research group here at Duke doing some application development and they are looking at implementing their codes in a cluster environment. The main problem is that 95% of their processing time is taken up by medium to large sized 3D FFTs (minimum 64 elements on an edge, 256k total elements).

So I was wondering what the current "state of the art" is in clustered 3D FFTs? I've googled around a bit, but most off the results seem a little dated. If someone could point me to any recent papers or studies, I would be grateful.

Some specifics that I am interested in would be a good comparison of different interconnects on overall performance, as this will have a significant impact on the design of their cluster.

Thanks,

-bill

--
Dr Stuart Midgley
[EMAIL PROTECTED]


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to