Michael Di Domenico > I would have to agree. I have Netgears in my lab now and for light > use they seem to be okay, but once you run a communications heavy MPI > job over them they seem to fall down
Please define "fall down". One test I have applied to a switch (only 100baseT) to see if it could handle "full traffic" was running the script below on all nodes: #!/bin/bash TINFO=`topology_info` NEXT=`echo $TINFO | extract -mt -cols [3]` if [ $NEXT != "none" ] then TIME=`accudate -t0` dd if=/dev/zero bs=4096 count=1000000 | rsh $NEXT 'cat - >/dev/null' accudate -ds $TIME >/tmp/elapsed_${HOSTNAME}.txt fi Where topology_info defines a linear chain through all nodes, and what ends up in the elapsed_HOSTNAME.txt files is transmission time from this to the next node. extract and accudate are mine, the former is like "cut" and the latter is just used here to calculate an elapsed time. This is slightly apples and oranges because in the two node (reference) test the target node is only accepting packets, whereas when they are all running it is also sending packets, and those compete with the ack's going back to the first node. The D-Link switch held up quite well, I thought. One pair of nodes tested this way completed in 350 seconds (+/-), whereas it and the others took 370-380 seconds when they were all running at once (20 compute nodes, first only sends, last only receives). That is, 11.7 MB/sec for the pair, 10.8 MB/sec for all pairs. For GigE it should come out at 117 and 108 (or so), if the switch can keep up. I'm curious what the netgears and HP do in a test like this. If anybody would like to try this, all the pieces for this simple test (if you can run binaries for a 32 bit x86 environment) are here: http://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/testswitch.tar.gz (For other platforms obtain source for accudate and extract from here http://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/drm_tools.tar.gz ) Start the jobs simultaneously on all nodes using whichever queue system you have installed. Be sure to run it once first with a small count number to force anything coming over nfs into cache before doing the big test. (Or one could run netpipe on each pair of nodes, or anything else really that loads the network.) Regards, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf