I wonder what the LV impact is here. Md is the fastest i have seen on these units with lv losing quite a bit of performance (20 percent or so as i recall).
Regards Joe --- joe landman [EMAIL PROTECTED] +1 734 612 4615 (sent from cell phone ... please excuse brevity and typos) -----Original Message----- From: "Glen Dosey" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: "Jeff Blasius" <[EMAIL PROTECTED]>; "Beowulf" <beowulf@beowulf.org> Sent: 8/23/2007 6:09 PM Subject: Re: [Beowulf] Network Filesystems performance On Thu, 2007-08-23 at 15:53 -0400, Joe Landman wrote: <snip> > Since you indicated RHEL4, its possible that something in kernel is > causing problems. RHEL4 is not known to be a speed demon. All the current testing is on RHEL5 actually. 64bit . It offered better performance than RHEL4. Everything in here refers to GigE and not infinband (since we want to keep that for MPI) modified entries in sysctl include: net.ipv4.tcp_window_scaling = 1 sunrpc.tcp_slot_table_entries = 128 net.core.netdev_max_backlog = 2500 net.core.wmem_max = 83886080 net.core.rmem_max = 83886080 net.core.wmem_default = 6553600 net.core.rmem_default = 6553600 net.ipv4.tcp_rmem = 4096 6553600 83886080 net.ipv4.tcp_wmem = 4096 6553600 83886080 > What about the usual suspects > > cat /proc/interrupts [EMAIL PROTECTED] ~]# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 176708034 176467911 178788831 178782166 IO-APIC-edge timer 1: 167 112 0 246 IO-APIC-edge i8042 8: 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-level acpi 12: 189 185 51 57 IO-APIC-edge i8042 50: 69688 963247 1227309 270953 IO-APIC-level qla2xxx 58: 15112 96722 96347 7613 IO-APIC-level qla2xxx 66: 47398161 0 0 0 IO-APIC-level eth0 74: 5 0 21502 0 IO-APIC-level eth1 217: 0 0 0 0 IO-APIC-level ohci_hcd:usb1, libata 225: 1 0 0 0 IO-APIC-level ehci_hcd:usb2 233: 30917 188965 204419 94202 IO-APIC-level libata NMI: 2933 2685 2485 1792 LOC: 710654972 710659916 710660939 710658940 ERR: 0 MIS: 0 > blockdev --getra /dev/sda We're using logical volumes, with an 8192 sector read ahead on the lv and disk. > ... > lspci -v > > Is your gigabit sharing a 100/133 MB/s old PCI bus with your RAID card? > On older motherboards, the gigabit NICs were put on an old PCI branch, > typically 100 MB/s max. If there is a PCI RAID card in the same slot, > or, as also often happened on these older MB's, the SATA ports were > hanging off the same old/slow PCI bus, well, it could explain your results. We're running Altus 1300 systems. There is just a QLA242 in the system on the PCIX Bus. There is no RAID, the storage is handled externally via the FC. Here's the output from lspci -tv [EMAIL PROTECTED] rules.d]# lspci -tv -+-[0000:06]-+-01.0-[0000:07]-- | +-01.1 Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC | +-02.0-[0000:08]--+-01.0 QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA | | \-01.1 QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA | \-02.1 Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC \-[0000:00]-+-00.0 nVidia Corporation CK804 Memory Controller +-01.0 nVidia Corporation CK804 ISA Bridge +-01.1 nVidia Corporation CK804 SMBus +-02.0 nVidia Corporation CK804 USB Controller +-02.1 nVidia Corporation CK804 USB Controller +-06.0 nVidia Corporation CK804 IDE +-07.0 nVidia Corporation CK804 Serial ATA Controller +-08.0 nVidia Corporation CK804 Serial ATA Controller +-09.0-[0000:01]----07.0 ATI Technologies Inc Rage XL +-0b.0-[0000:02]----00.0 Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express +-0c.0-[0000:03]-- +-0d.0-[0000:04]----00.0 Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express +-0e.0-[0000:05]-- The systems have 4GB RAM and are dual Opteron 285. The externally attached Xyratex 5200 storage is connected via 2Gbit fibre via a Qlogic Switch to a 12 disk array using a hardware raid controller configured for 10+1 raid 5 with a hot spare and 128K chunks for a total 1280K stripe. The ext3 filesystem was created with a stride of 32. The partition table and volume labels were each offset by 128MB to account for disk alignment with stripe writes. The disks are 500GB Seagate SATA drives, model ST3500641NS. The array controller has the read ahead disabled and and a 256MB writeaback enabled. This is the only system utilizing the array/enclosure/controller. The filesystem is 4.9TB in size. Here's a set of Bonnie++ numbers if it matters(sorry for the formatting, copied from an html file) Ext3 8G 57304 90 92685 34 52007 12 66123 90 178088 19 401.9 0 16:786432:0/16 47 5 112 4 1782 19 49 6 41 1 378 5 or the ever popular (but totally unrealistic) series of dd tests Read on NFS server [EMAIL PROTECTED] ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k 510933+0 records in 510932+0 records out 2092777472 bytes (2.1 GB) copied, 12.6766 seconds, 165 MB/s (disk was unmounted on server to clear cache) Read from NFS client [EMAIL PROTECTED] ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k 418341+0 records in 418340+0 records out 1713520640 bytes (1.7 GB) copied, 30.2718 seconds, 56.6 MB/s Write on NFS client [EMAIL PROTECTED] ~]# dd if=/dev/zero of=/mnt/array3/file.dd bs=4k count=256000 256000+0 records in 256000+0 records out 1048576000 bytes (1.0 GB) copied, 10.1124 seconds, 104 MB/s now we unmount the NFS share, recreate the file on the server, and remount it to clear the client cache but leave it cached on the server [EMAIL PROTECTED] ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k 524287+0 records in 524287+0 records out 2147479552 bytes (2.1 GB) copied, 18.5161 seconds, 116 MB/s Since our NFS is over TCP here's the iperf test results, which basically confirm the above dd results. [EMAIL PROTECTED] ~]# ./iperf -c server ------------------------------------------------------------ Client connecting to server, TCP port 5001 TCP window size: 6.25 MByte (default) ------------------------------------------------------------ [ 3] local client port 37325 connected with server port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.10 GBytes 941 Mbits/sec iftop confirms the basic numbers I've been talking about. Additionally I have been graphing per port utilization on the Qlogic FC switch and it confirms the numbers I've been seeing on the disk side of things and helps determine if the file is in cache or not (or partially). atop shows basically the same iostat does, which is that on the initial read the FC disk is about %85 percent utilized and the network is about %50 utilized. No other resource seems to be close to it's limit. On subsequent reads the disk is not touched and the network is %100 utilized. I have never used dstat before. I will read up on it and see if it reveals anything interesting. > > Which MB do you have? Which bios rev, ... Which raid card, how much > ram, 32 or 64 bit, yadda yadda yadda (all the details you didnt give > before). > > Joe > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf