[Beowulf] Slow RAID reads, no errors logged, why?

David Mathog Mon, 19 Mar 2018 13:58:55 -0700

On one of our Centos 6.9 systems with a PERC H370 controller I justnoticedthat file system reads are quite slow. Like 30Mb/s slow. Anybody careto hazard a guess what might be causing this situation? We have anotherquite similar machine which is fast (A), compared to this (B) which isslow:

           A      B
RAM        512    512     GB
CPUs       48     56      (via /proc/cpuinfo, actually this is threads)
Adapter    H710P  H730
RAID Level *      *       Primary-5, Secondary-0, RAID Level Qualifier-3
Size       7.275  9.093   TB
state      *      *       Optimal
Drives     5      6

read rate 540 30 Mb/s (dd if=largefile bs=8192 of=/dev/null& ;iotop)

sata disk   ST2000NM0033
sas disk          ST2000NM0023
patrol     No    No       (megacli shows patrol read not going now)


ulimit -a on both is:
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2067196
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 60000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Nothing in the SMART values indicating a read problem, although on "B"
one disk is slowly accumulating events in the write x rereads/rewrites

measurement (it has 2346, accumulated at about 10 per week). The valueis 0 there for reads x rereads/rewrites. For "B" the smartctl outputcolumns are:


 Errors Corrected by         Total   Correction     Gigabytes    Total

ECC rereads/ errors algorithm processeduncorrected

   fast | delayed rewrites corrected invocations   [10^9 bytes]  errors

read: 934353848  0 0 934353848  0 48544.026 0
read: 2017672022 0 0 2017672022 0 48574.489 0
read: 2605398517 3 0 2605398520 3 48516.951 0
read: 3237457411 1 0 3237457412 1 48501.302 0
read: 2028103953 0 0 2028103953 0 14438.132 0
read: 197018276  0 0 197018276  0 48640.023 0

write: 0 0 0 0 0 26394.472 0
write: 0 0 2346 2346 2346 26541.534 0
write: 0 0 0 0 0 27549.205 0
write: 0 0 0 0 0 25779.557 0
write: 0 0 0 0 0 11266.293 0
write: 0 0 0 0 0 26465.227 0

verify: 341863005  0 0 341863005  0 241374.368 0
verify: 866033815  0 0 866033815  0 223849.660 0
verify: 2925377128 0 0 2925377128 0 221697.809 0
verify: 1911833396 6 0 1911833402 6 228054.383 0
verify: 192670736  0 0 192670736  0 66322.573 0
verify: 1181681503 0 0 1181681503 0 222556.693 0

If the process doing the IO is root it doesn't go any faster.

Oddly if on "B" a second dd process is started on another file it ALSOreads at 30Mb/s. So the disk system then does a total of 60Gb/s, butonly 30Gb/s per process. Added a 3rd and a 4th process doing the same.At the 4th it seemed to hit some sort of limit, with each process nowconsistently less than 30Gb/s and the total at maybe 80Gb/s total. Hardto say what the exact total was as it was jumping around like crazy. On"A" 2 processes each got 270Mb/s,

and 3 180Mb/s.  Didn't try 4.

The only oddness of late on "B" is that a few days ago it loaded toomany memory hungry processes so the OS killed some. I have had thathappen before on other systems without them doing anything oddafterwards.


Any ideas what this slowdown might be?

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Slow RAID reads, no errors logged, why?

Reply via email to