RE: [Beowulf] Network Filesystems performance

joe landman Thu, 23 Aug 2007 15:36:24 -0700

I wonder what the LV impact is here.  Md is the fastest i have seen on these 
units with lv losing quite a bit of performance (20 percent or so as i recall).


Regards

Joe
---
joe landman
[EMAIL PROTECTED] 
+1 734 612 4615
(sent from cell phone ... please excuse brevity and typos)

-----Original Message-----
From: "Glen Dosey" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: "Jeff Blasius" <[EMAIL PROTECTED]>; "Beowulf" <beowulf@beowulf.org>
Sent: 8/23/2007 6:09 PM
Subject: Re: [Beowulf] Network Filesystems performance

On Thu, 2007-08-23 at 15:53 -0400, Joe Landman wrote:
<snip>
> Since you indicated RHEL4, its possible that something in kernel is
> causing problems.  RHEL4 is not known to be a speed demon.

All the current testing is on RHEL5 actually. 64bit . It offered better
performance than RHEL4. Everything in here refers to GigE and not
infinband (since we want to keep that for MPI)

modified entries in sysctl include:
net.ipv4.tcp_window_scaling = 1
sunrpc.tcp_slot_table_entries = 128
net.core.netdev_max_backlog = 2500
net.core.wmem_max = 83886080
net.core.rmem_max = 83886080
net.core.wmem_default = 6553600
net.core.rmem_default = 6553600
net.ipv4.tcp_rmem = 4096 6553600 83886080
net.ipv4.tcp_wmem = 4096 6553600 83886080

> What about the usual suspects
> 
>       cat /proc/interrupts

[EMAIL PROTECTED] ~]# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       
  0:  176708034  176467911  178788831  178782166    IO-APIC-edge  timer
  1:        167        112          0        246    IO-APIC-edge  i8042
  8:          0          0          0          0    IO-APIC-edge  rtc
  9:          0          0          0          0   IO-APIC-level  acpi
 12:        189        185         51         57    IO-APIC-edge  i8042
 50:      69688     963247    1227309     270953   IO-APIC-level  qla2xxx
 58:      15112      96722      96347       7613   IO-APIC-level  qla2xxx
 66:   47398161          0          0          0   IO-APIC-level  eth0
 74:          5          0      21502          0   IO-APIC-level  eth1
217:          0          0          0          0   IO-APIC-level  
ohci_hcd:usb1, libata
225:          1          0          0          0   IO-APIC-level  ehci_hcd:usb2
233:      30917     188965     204419      94202   IO-APIC-level  libata
NMI:       2933       2685       2485       1792 
LOC:  710654972  710659916  710660939  710658940 
ERR:          0
MIS:          0


>       blockdev --getra /dev/sda

We're using logical volumes, with an 8192 sector read ahead on the lv
and disk.


> ...
>       lspci -v
> 
> Is your gigabit sharing a 100/133 MB/s old PCI bus with your RAID card?
> On older motherboards, the gigabit NICs were put on an old PCI branch,
> typically 100 MB/s max.  If there is a PCI RAID card in the same slot,
> or, as also often happened on these older MB's, the SATA ports were
> hanging off the same old/slow PCI bus, well, it could explain your results.

We're running Altus 1300 systems. There is just a QLA242 in the system
on the PCIX Bus. There is no RAID, the storage is handled externally via
the FC.  Here's the output from lspci -tv

[EMAIL PROTECTED] rules.d]# lspci -tv
-+-[0000:06]-+-01.0-[0000:07]--
 |           +-01.1  Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
 |           +-02.0-[0000:08]--+-01.0  QLogic Corp. ISP2312-based 2Gb Fibre 
Channel to PCI-X HBA
 |           |                 \-01.1  QLogic Corp. ISP2312-based 2Gb Fibre 
Channel to PCI-X HBA
 |           \-02.1  Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
 \-[0000:00]-+-00.0  nVidia Corporation CK804 Memory Controller
             +-01.0  nVidia Corporation CK804 ISA Bridge
             +-01.1  nVidia Corporation CK804 SMBus
             +-02.0  nVidia Corporation CK804 USB Controller
             +-02.1  nVidia Corporation CK804 USB Controller
             +-06.0  nVidia Corporation CK804 IDE
             +-07.0  nVidia Corporation CK804 Serial ATA Controller
             +-08.0  nVidia Corporation CK804 Serial ATA Controller
             +-09.0-[0000:01]----07.0  ATI Technologies Inc Rage XL
             +-0b.0-[0000:02]----00.0  Broadcom Corporation NetXtreme BCM5721 
Gigabit Ethernet PCI Express
             +-0c.0-[0000:03]--
             +-0d.0-[0000:04]----00.0  Broadcom Corporation NetXtreme BCM5721 
Gigabit Ethernet PCI Express
             +-0e.0-[0000:05]--

The systems have 4GB RAM and are dual Opteron 285.

The externally attached Xyratex 5200 storage is connected via 2Gbit
fibre via a Qlogic Switch to a 12 disk array using a hardware raid
controller configured for 10+1 raid 5 with a hot spare and 128K chunks
for a total 1280K stripe. The ext3 filesystem was created with a stride
of 32. The partition table and volume labels were each offset by 128MB
to account for disk alignment with stripe writes. The disks are 500GB
Seagate SATA drives, model ST3500641NS. The array controller has the
read ahead disabled and and a 256MB writeaback enabled. This is the only
system utilizing the array/enclosure/controller. The filesystem is 4.9TB
in size.


Here's a set of Bonnie++ numbers if it matters(sorry for the formatting,
copied from an html file)
Ext3    8G      57304   90      92685   34      52007   12      66123   90      
178088  19      401.9   0       16:786432:0/16  47      5       112     4       
1782    19      49      6       41      1       378     5

or the ever popular (but totally unrealistic) series of dd tests 

Read on NFS server
[EMAIL PROTECTED] ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
510933+0 records in
510932+0 records out
2092777472 bytes (2.1 GB) copied, 12.6766 seconds, 165 MB/s

(disk was unmounted on server to clear cache)

Read from NFS client
[EMAIL PROTECTED] ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
418341+0 records in
418340+0 records out
1713520640 bytes (1.7 GB) copied, 30.2718 seconds, 56.6 MB/s

Write on NFS client
[EMAIL PROTECTED] ~]# dd if=/dev/zero of=/mnt/array3/file.dd bs=4k count=256000
256000+0 records in
256000+0 records out
1048576000 bytes (1.0 GB) copied, 10.1124 seconds, 104 MB/s

now we unmount the NFS share, recreate the file on the server, and remount it 
to clear the client cache but leave it cached on the server

[EMAIL PROTECTED] ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
524287+0 records in
524287+0 records out
2147479552 bytes (2.1 GB) copied, 18.5161 seconds, 116 MB/s



Since our NFS is over TCP here's the iperf test results, which basically 
confirm the above dd results.

[EMAIL PROTECTED] ~]# ./iperf -c server
------------------------------------------------------------
Client connecting to server, TCP port 5001
TCP window size: 6.25 MByte (default)
------------------------------------------------------------
[  3] local client port 37325 connected with server port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   941 Mbits/sec



iftop confirms the basic numbers I've been talking about. Additionally I
have been graphing per port utilization on the Qlogic FC switch and it
confirms the numbers I've been seeing on the disk side of things and
helps determine if the file is in cache or not (or partially).

atop shows basically the same iostat does, which is that on the initial
read the FC disk is about %85 percent utilized and the network is about
%50 utilized. No other resource seems to be close to it's limit. On
subsequent reads the disk is not touched and the network is %100
utilized.

I have never used dstat before. I will read up on it and see if it
reveals anything interesting.


> 
> Which MB do you have?  Which bios rev, ...  Which raid card, how much
> ram, 32 or 64 bit, yadda yadda yadda (all the details you didnt give
> before).
> 
> Joe
> 


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

RE: [Beowulf] Network Filesystems performance

Reply via email to