Hi Ben-- Thanks for the quick followup!
On 08/07/2011 12:36 PM, Ben Hutchings wrote: > On Fri, 2011-08-05 at 18:36 -0400, Daniel Kahn Gillmor wrote: >> We've applied the attached patch (a simple workaround to ensure no >> division-by-zero) to the debian packages for several weeks in production >> (over a month on some machines) and haven't seen a recurrence of the >> problem. > > This doesn't really fix the bug - division by zero is just a symptom of > a more fundamental problem which has yet to be identified. yep, that's why i called it a workaround :) > As a result, > it hasn't been accepted upstream and won't be accepted in Debian. > > That said, I would consider applying a variant that WARNs before 'fixing > up' the zero divisor, as a *temporary* measure to aid in understanding > the bug (more like > <https://bugzilla.kernel.org/show_bug.cgi?id=16991#c13>). That sounds reasonable to me. Are you up for preparing such a patch or do you need me to do it? > I notice your 'oops' messages show 'Tainted: G W' which indicates there > was an earlier kernel warning. What was the previous warning? hmm, we've seen this on multiple machines, and they didn't all have a prior warning. in the referenced machine, though, it was 5 months previously, a netdev watchdog timeout. It doesn't seem related to me, but i'm happy to include the dump here in case anyone else can extract meaning from it: >> 2011-01-04_10:28:18.85061 [3129874.324489] ------------[ cut here >> ]------------ >> 2011-01-04_10:28:18.89235 [3129874.329286] WARNING: at >> /build/buildd-linux-2.6_2.6.32-28-amd64-EUJiNq/linux-2.6-2.6.32/debian/build/source_amd64_none/net/sched/sch_generic.c:261 >> dev_watchdog+0xe2/0x194() >> 2011-01-04_10:28:18.89236 [3129874.344808] Hardware name: PowerEdge R410 >> 2011-01-04_10:28:18.89237 [3129874.348981] NETDEV WATCHDOG: eth0 (bnx2): >> transmit queue 1 timed out >> 2011-01-04_10:28:18.89238 [3129874.355561] Modules linked in: btrfs >> zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat >> jfs xfs exportfs reiserfs ext4 jbd2 crc16 ext2 bridge stp kvm_intel kvm tun >> loop snd_pcm snd_timer snd soundcore snd_page_alloc dcdbas pcspkr psmouse >> serio_raw evdev button power_meter processor ext3 jbd mbcache sha256_generic >> aes_x86_64 aes_generic cbc dm_crypt dm_mod raid1 md_mod sd_mod crc_t10dif sg >> sr_mod cdrom ata_generic uhci_hcd mpt2sas ehci_hcd thermal ata_piix >> thermal_sys usbcore nls_base scsi_transport_sas libata scsi_mod bnx2 [last >> unloaded: scsi_wait_scan] >> 2011-01-04_10:28:18.89240 [3129874.408913] Pid: 0, comm: swapper Not tainted >> 2.6.32-5-amd64 #1 >> 2011-01-04_10:28:18.89240 [3129874.415063] Call Trace: >> 2011-01-04_10:28:18.89242 [3129874.417740] <IRQ> [<ffffffff81261c12>] ? >> dev_watchdog+0xe2/0x194 >> 2011-01-04_10:28:18.89243 [3129874.424219] [<ffffffff81261c12>] ? >> dev_watchdog+0xe2/0x194 >> 2011-01-04_10:28:18.89244 [3129874.430018] [<ffffffff8104dd6c>] ? >> warn_slowpath_common+0x77/0xa3 >> 2011-01-04_10:28:18.89245 [3129874.436423] [<ffffffff81261b30>] ? >> dev_watchdog+0x0/0x194 >> 2011-01-04_10:28:18.89246 [3129874.442131] [<ffffffff8104ddf4>] ? >> warn_slowpath_fmt+0x51/0x59 >> 2011-01-04_10:28:18.89247 [3129874.448276] [<ffffffff81041b41>] ? >> enqueue_task_fair+0x3e/0x82 >> 2011-01-04_10:28:18.89248 [3129874.454420] [<ffffffff8103fbfa>] ? >> task_rq_lock+0x46/0x79 >> 2011-01-04_10:28:18.89249 [3129874.460132] [<ffffffff8104a252>] ? >> try_to_wake_up+0x2a7/0x2b9 >> 2011-01-04_10:28:18.89250 [3129874.466191] [<ffffffff81261b04>] ? >> netif_tx_lock+0x3d/0x69 >> 2011-01-04_10:28:18.89250 [3129874.471989] [<ffffffff8124c97c>] ? >> netdev_drivername+0x3b/0x40 >> 2011-01-04_10:28:18.89251 [3129874.478132] [<ffffffff81261c12>] ? >> dev_watchdog+0xe2/0x194 >> 2011-01-04_10:28:18.89252 [3129874.483930] [<ffffffff8103a9cd>] ? >> __wake_up_common+0x44/0x72 >> 2011-01-04_10:28:18.89253 [3129874.489992] [<ffffffff81057560>] ? >> cascade+0x5f/0x77 >> 2011-01-04_10:28:18.89253 [3129874.495278] [<ffffffff8105a337>] ? >> run_timer_softirq+0x1c9/0x268 >> 2011-01-04_10:28:18.89254 [3129874.501594] [<ffffffff81053aaf>] ? >> __do_softirq+0xdd/0x1a2 >> 2011-01-04_10:28:18.89256 [3129874.507398] [<ffffffff8102419a>] ? >> lapic_next_event+0x18/0x1d >> 2011-01-04_10:28:18.89256 [3129874.513458] [<ffffffff81011cac>] ? >> call_softirq+0x1c/0x30 >> 2011-01-04_10:28:18.89257 [3129874.519166] [<ffffffff8101322b>] ? >> do_softirq+0x3f/0x7c >> 2011-01-04_10:28:18.89261 [3129874.524774] [<ffffffff8105391e>] ? >> irq_exit+0x36/0x76 >> 2011-01-04_10:28:19.85162 [3129874.530164] [<ffffffff81024c68>] ? >> smp_apic_timer_interrupt+0x87/0x95 >> 2011-01-04_10:28:19.85163 [3129874.536911] [<ffffffff81011673>] ? >> apic_timer_interrupt+0x13/0x20 >> 2011-01-04_10:29:45.93714 x9d/0xb8 [processor] >> 2011-01-04_10:29:45.93717 [3129874.551277] [<ffffffffa01c024c>] ? >> acpi_idle_enter_c1+0x78/0xb8 [processor] >> 2011-01-04_10:29:45.93718 [3129874.558550] [<ffffffff81238f62>] ? >> cpuidle_idle_call+0x94/0xee >> 2011-01-04_10:29:45.93719 [3129874.564695] [<ffffffff8100feb1>] ? >> cpu_idle+0xa2/0xda hth, --dkg
signature.asc
Description: OpenPGP digital signature