Hi, I hope your issue has been resolved meanwhile. I had a somehow similar mixed experience with Dell branded LSI controllers. It would appear that some models are just not fit for particular workloads. I have put some information in our blog at http://www.gridpp.rl.ac.uk/blog/2013/06/14/lsi-1068e-issues-understood-and-resolved/
Cheers, Dimitris On Thu, Jul 31, 2014 at 7:37 PM, mathog <[email protected]> wrote: > Any pointers on why a system might appear to "stall" on very high IO through > an LSI megaraid adapter? (dm_raid45, on RHEL 5.10.) > > I have been working on another group's big Dell server, which has 16 CPUs, > 82 GB of memory, and 5 1TB disks which go through an LSI Megaraid (not sure > of the exact configuration and their system admin is out sick) and show up > as /dev/sda[abc], where the first two are just under 2 TB and the third is > /boot and is about 133 Gb. sda and sdb are then combined through lvm into > one big volume and that is what is mounted. > > Yesterday on this system when I ran 14 copies of this simultaneously: > > # X is 0-13 > gunzip -c bigfile${X}.gz > resultfile${X} > > the first time, part way through, all of my terminals locked up for several > minutes, and then recovered. Another similar command had the same issue > about half an hour later, but others between and since did not stall. The > size of the files unpacked is only about 0.5Gb, so even if the entire file > was stored in memory in the pipes all 14 should have fit in main memory. > Nothing else was running (at least that I noticed before or after, something > might have started up during the run and ended before I could look for it.) > During this period the system would still answer pings. Nothing showed up > in /var/log/messages or dmesg, "last" showed nobody else had logged in, and > overnight runs of "smartctl -t long" on the 5 disks were clean - nothing > pending, no reallocation events. > > Today ran the first set of commands again with "nice 10" and had "top" going > and nothing untoward was observed and there were no stalls. On that run > iostat showed: > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > sda 6034.00 0.00 529504.00 0 529504 > sda5 6034.00 0.00 529504.00 0 529504 > dm-0 68260.00 2056.00 546008.00 2056 546008 > > > So why the apparent stalls yesterday? It felt like either my interactive > processes were swapped out or they had a much lower priority than enough > other processes so that they were not getting any CPU time. Is there some > sort of housekeeping that the Megaraid, LVM, or anything normally installed > with RHEL 5.10, might need to do, from time to time, that would account for > these stalls? > > Thanks, > > David Mathog > [email protected] > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
