On Wed, 2011-08-17 at 17:38 -0400, Micah Anderson wrote: > Package: linux-image-2.6.32-5-686-bigmem > Version: 2.6.32-35 > Severity: important > Tags: squeeze > > I have two machines that I upgraded to squeeze and migrated their ext3 > filesystems to ext4 due to very high i/o and deep directory hierarchy. These > two > machines have been crashing regularly since the ext4 upgrade. The other > machines > that I have that are running the squeeze kernel and ext3 are not crashing at > all. > > When I upgraded the two crashy machines to the backports kernel, the crashes > stopped. The crashes were happening at least 2x a week, sometimes much more > frequently. Since the upgrade to the BPO kernel, the machines haven't crashed > once in two months. > > Both machines were showing console logs when they crashed that were similar: > either they had nothing on them at all, or they had the following (in some > cases > magic-sysrq worked, sometimes it didn't). > > It seems pretty clear to me that there are some instability issues with ext4 > in the squeeze kernel. After discussion with Ted Tso on the subject, he > indicated > that there were a number of ext4 fixes that have been done that have not been > backported to the squeeze kernel. > > What follows are a few of the different things we saw on the console when the > machine hung:
None of these logs show crashes. > 1. > hoopoe login: [51589.926858] Uniform Multi-Platform E-IDE driver > [51589.943819] ide-cd driver 5.00 > [51589.982978] ide-gd driver 1.18 I hope you're not actually using the ide-cd driver. > [51590.039277] st: Version 20081215, fixed bufsize 32768, s/g segs 256 > [51590.262980] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found > [51590.269224] EDD information not available. > [137993.853645] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found > [137993.860140] EDD information not available. > [138361.949699] INFO: task rdiff-backup:28337 blocked for more than 120 > seconds. > [138361.957345] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [138361.965791] rdiff-backup D f6967700 0 28337 28335 0x00000000 > [138361.972772] e8df2640 00200086 c5808e20 f6967700 f696772c c143de20 > c143de20 c1439354 > [138361.999573] e8df27fc c5808e20 00000000 c143de20 ea15f800 c5808e20 > ea15f800 c127eb36 > [138362.008374] c5804354 e8df27fc 020e12de c143de20 c143de20 00000000 > 00000000 00000000 > [138362.034480] Call Trace: > [138362.037276] [<c127eb36>] ? schedule+0x78f/0x7dc > [138362.042407] [<c127f28f>] ? __mutex_lock_common+0xe8/0x13b > [138362.048233] [<c127f2f1>] ? __mutex_lock_slowpath+0xf/0x11 > [138362.054239] [<c127f382>] ? mutex_lock+0x17/0x24 > [138362.059403] [<c127f382>] ? mutex_lock+0x17/0x24 > [138362.081061] [<c10d40c5>] ? sync_filesystems+0xf/0xbb > [138362.104668] [<c10d41a3>] ? sys_sync+0xe/0x29 > [138362.109660] [<c100813b>] ? sysenter_do_call+0x12/0x28 Probably the disk is being thrashed so sync takes a very long time. As I understand it, Linux 2.6.32 had some major changes to writeback (delayed writes to disk) which made improvements to behaviour in some situations but had regressions in others. Unfortunately there isn't a simple fix that can be cherry-picked. It could also be a locking bug but I kind of doubt it. If you're able to see whether there is ongoing disk I/O then that could confirm which is the case. > 2. > hoopoe login: [17163.173748] Uniform Multi-Platform E-IDE driver > [17163.187023] ide-cd driver 5.00 > [17163.216390] ide-gd driver 1.18 > [17163.269327] st: Version 20081215, fixed bufsize 32768, s/g segs 256 > [17163.425212] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found > [17163.431412] EDD information not available. > [32426.998664] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 > frozen > [32427.005956] ata4.00: failed command: WRITE FPDMA QUEUED > [32427.011225] ata4.00: cmd 61/08:00:57:a2:33/00:00:3a:00:00/40 tag 0 ncq > 4096 out > [32427.011227] res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 > (timeout) > [32427.026096] ata4.00: status: { DRDY } > [32427.029818] ata4: hard resetting link > [32427.509533] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [32427.553582] ata4.00: configured for UDMA/133 > [32427.557906] ata4: EH complete Likely hardware (or drive firmware) problem, though it could possibly be a bug in libata. > 3. [...] Same. > 4. [...] Probably disk thrashing again. If the problem in cases 1 and 4 really is disk thrashing, it may be worth trying to tune writeback via sysctl vm.dirty_ratio, as explained in https://lwn.net/Articles/399148/ Cases 2 and 3 are clearly different; you should open a separate bug report if you think they are not hardware/firmware issues. Ben.
signature.asc
Description: This is a digitally signed message part