[Beowulf] LSI Megaraid stalls system on very high IO?

mathog Thu, 31 Jul 2014 10:21:03 -0700

Any pointers on why a system might appear to "stall" on very high IOthrough an LSI megaraid adapter? (dm_raid45, on RHEL 5.10.)

I have been working on another group's big Dell server, which has 16CPUs, 82 GB of memory, and 5 1TB disks which go through an LSI Megaraid(not sure of the exact configuration and their system admin is out sick)and show up as /dev/sda[abc], where the first two are just under 2 TBand the third is /boot and is about 133 Gb. sda and sdb are thencombined through lvm into one big volume and that is what is mounted.


Yesterday on this system when I ran 14 copies of this simultaneously:

  # X is 0-13
  gunzip -c bigfile${X}.gz > resultfile${X}

the first time, part way through, all of my terminals locked up forseveral minutes, and then recovered. Another similar command had thesame issue about half an hour later, but others between and since didnot stall. The size of the files unpacked is only about 0.5Gb, so evenif the entire file was stored in memory in the pipes all 14 should havefit in main memory. Nothing else was running (at least that I noticedbefore or after, something might have started up during the run andended before I could look for it.) During this period the system wouldstill answer pings. Nothing showed up in /var/log/messages or dmesg,"last" showed nobody else had logged in, and overnight runs of "smartctl-t long" on the 5 disks were clean - nothing pending, no reallocationevents.

Today ran the first set of commands again with "nice 10" and had "top"going and nothing untoward was observed and there were no stalls. Onthat run iostat showed:


Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda            6034.00         0.00    529504.00          0     529504
sda5           6034.00         0.00    529504.00          0     529504
dm-0          68260.00      2056.00    546008.00       2056     546008

So why the apparent stalls yesterday? It felt like either myinteractive processes were swapped out or they had a much lower prioritythan enough other processes so that they were not getting any CPU time.Is there some sort of housekeeping that the Megaraid, LVM, or anythingnormally installed with RHEL 5.10, might need to do, from time to time,that would account for these stalls?


Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] LSI Megaraid stalls system on very high IO?

Reply via email to