Hello all. I run a small Linux cluster using gigabit ethernet as the interconnect There are two families of nodes:
(*) Dual-processor AMD K7-MP 2600+ models with onboard e1000 network interfaces, single port, with a 66MHz, 64-bit PCI bus connection according to dmesg (*) Dual-processor AMD K8 Model 246 with onboard tg3 (BCM95704A7) network interfaces, dual ports, with a 100 MHz, 64-bit PCIX bus connection according to dmesg The previous cluster admins were using a junk commodity netgear gigabit switch. I just upgraded to a vastly better switch with support for jumbo frames, link aggregation, etc. I'm aware of some private tests with a network load generator that show this switch meets its advertised specs. I've been trying to quantify the performance differences between the cluster running on the previous switch vs. the new one. I've been using the Intel MPI Benchmarks (IMB) as well as IOzone in network mode and also IOR. In the previous configuration, the 64-bit nodes had only a single connection to the switch, and the MTU was 1500. Under the new configuration, all nodes are now running with an MTU of 9000 and the 64-bit nodes with the tg3s are set up with the Linux bonding driver to form 802.3ad aggregated links using both ports per aggregate link. I've not adjusted any sysctls or driver settings. The e1000 driver is version 7.3.20-k2-NAPI as shipped with the Linux kernel. The Linux kernel in use here is 2.6.22 and the MPI distribution is OpenMPI 1.2.4, across the board. I've noticed some interesting performance results. On benchmarks with large MPI datasets and a lot of cross-communication, the new switch beats the old one to the tune of anywhere from 20% up to about 70%. Not that surprising, since the greatest advantage of the new switch vs. old would be in the high-capacity switching fabric. However, for a lot of the benchmarks and especially with smaller dataset sizes, the performance was surprisingly close or in significant favor of the old switch. On the parallel I/O tests which wrote and read an NFS volume on the head node (also with link aggregation), the results for IOR were slightly lower with the new switch vs. the old one. That surprised me given that now jumbo frames were in use and that the head node (same motherboard/network configuration as the 64-bit compute nodes) was using link aggregation. With IOzone, as the stride sizes increased, the new switch performance dominated the old one, but for the backward read test as well as tests with smaller stride sizes, performance was often a toss-up. For small-to-moderate datasets, there were several cases in the IMB results where the old switch was better than the new one. In trying to understand this, I noticed that ifconfig listed something like 2000 - 2500 dropped packets for the bonded interfaces on each node. This was following a pass of IMB-MPI1 and IMB-EXT. The dropped packet counts seem split roughly equally across the two bonded slave interfaces. Am I correct in taking this to mean the incoming load on the bonded interface was simply too high for the node to service all the packets? I can also note that I tried both "layer2" and "layer3+4" for the "xmit_hash_policy" bonding parameter, without any significant difference. The switch itself uses only a layer2-based hash. I'm sure that some of the reason why the new switch did not beat the previous one as decisively and across-the-board is due to the differing switch hardware strategies regarding packet forwarding, buffering, etc. But I'm more concerned with how much the lackluster performance spots are due to my lack of tuning the Linux networking environment and drivers in any way. To that end, I'd really appreciate your input on the following questions: 1. What are general network/TCP tuning parameters, e.g. buffer sizes, etc. that I should change or experiment with? For older kernels, and especially with the 2.4 series, changing the socket buffer size was recommended. However, various pieces of documentation such as http://www.netapp.com/library/tr/3183.pdf indicate that the newer 2.6 series kernels "auto-tune" these buffers. Is there still any benefit to manually adjusting them? 2. For the e1000, using the Linux kernel version of the driver, what are the relevant tuning parameters, and what have been your experiences in trying various values? There are knobs for the interrupt throttling rate, etc. but I'm not sure where to start. 3. For the tg3, again, what are the relevant tuning parameters, and what have been your experiences in trying various values? I've found it more difficult to find discussions for the "tunables" for tg3 as compared to e1000. 4. What has been people's recent experience using the Linux kernel bonding driver to do 802.3ad link aggregation? What kind of throughput scaling have you folks seen, and what about processor load? 5. What suggestions are there regarding trying to reduce the number of dropped packets? Thanks for your advice and input. ____________________________________________________________________________________ Be a better pen pal. Text or chat with friends inside Yahoo! Mail. See how. http://overview.mail.yahoo.com/ _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf