Just a small update, but without any good news. I never got a single reply or inquiry from the kernel.org developers, neither the mailinglist nor from the tg3 driver developers. It seems I've hit a dead end, as I'm out of options (except for asking Dell for Intel NIC's for all my servers until this gets fixed, but I doubt that'll go through ...).
As an extra, we had a server crash again last week, exactly the same behaviour as with the other ones, so it isn't only this single isolated case we have, other server-vm-iscsi combo's have this also. If any of you has any suggestion how this can be addressed further, any help would be appreciated. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1331513 Title: 14e4:165f tg3 eth1: transmit timed out, resetting on BCM5720 Status in The Dell PowerEdge project: Triaged Status in “linux” package in Ubuntu: Triaged Status in “linux-lts-saucy” package in Ubuntu: Confirmed Bug description: we have a problem with Dell PowerEdge machines, having the Broadcom 5720 chip. We have this problem on generation 12 systems, across different models (R420, R620), with several combinations of bios firmwares, lifecycle firmwares, etc... We see this on several versions of the linux kernel, ranging from 3.2.x up tot 3.11, with several versions of the tg3 driver, including a manually compiled latest version (3.133d) loaded in a 3.11. The latest machine, where we can reproduce the problem has Ubuntu Precise installed, but we also see this behaviour on Debian machines. We run Xen on it, running HVM hosts on it. Storage is handled over iSCSI (and it is the iSCSI interface we can trigger this bug on in a reproducible way, while we have the impression it also happens on other interfaces, but there we don't have a solid case where we have e reproducible setup). All this info actually points into the direction of the tg3 driver and/or hardware below it not handling certain datastreams or data patterns correctly, and finally crashing the system. It seems unrelated to the version of kernel running, xen-version running, amount of VM's running, firmwares and revisions running, etc... We have been trying to pinpoint this for over a year now, being unable to actually create a scenario where we could reproduce this. As of this week, we finally found a specific setup where we could trigger the error within a reasonable time. The error is triggered by running a certain VM on the Xen stack, and inside that VM, importing a mysqldump in a running mysql on that VM. The VM has it's traffic on an iSCSI volume, so this effectually generates a datastream over the eth1 interface of the machine. Within a short amount of time, the system will crash in 2 steps. We first see a timeout on the tg3 driver on the eth1 interface (dmesg output section attached). This sometimes repeats two or three times, and finally, step 2, the machine freezes and reboots. While debugging, we noticed that the bug goes away when we disable sg offloading with ethtool. If you need any additional info, feel free to ask. ProblemType: Bug DistroRelease: Ubuntu 12.04 Package: linux-image-3.11.0-19-generic 3.11.0-19.33~precise1 ProcVersionSignature: Ubuntu 3.11.0-19.33~precise1-generic 3.11.10.5 Uname: Linux 3.11.0-19-generic x86_64 AlsaDevices: total 0 crw-rw---T 1 root audio 116, 1 Jun 18 16:36 seq crw-rw---T 1 root audio 116, 33 Jun 18 16:36 timer AplayDevices: Error: [Errno 2] No such file or directory ApportVersion: 2.0.1-0ubuntu17.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found. Date: Wed Jun 18 16:47:27 2014 HibernationDevice: RESUME=UUID=f3577e02-64e3-4cab-b6e7-f30efa111565 InstallationMedia: Ubuntu-Server 12.04.4 LTS "Precise Pangolin" - Release amd64 (20140204) MachineType: Dell Inc. PowerEdge R420 MarkForUpload: True PciMultimedia: ProcFB: ProcKernelCmdLine: placeholder root=UUID=bbc71780-90bf-4647-b579-e48d5d8c2bce ro vga=0x317 RelatedPackageVersions: linux-restricted-modules-3.11.0-19-generic N/A linux-backports-modules-3.11.0-19-generic N/A linux-firmware 1.79.12 RfKill: Error: [Errno 2] No such file or directory SourcePackage: linux-lts-saucy UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 01/20/2014 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.1.2 dmi.board.name: 0JD6X3 dmi.board.vendor: Dell Inc. dmi.board.version: A00 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.1.2:bd01/20/2014:svnDellInc.:pnPowerEdgeR420:pvr:rvnDellInc.:rn0JD6X3:rvrA00:cvnDellInc.:ct23:cvr: dmi.product.name: PowerEdge R420 dmi.sys.vendor: Dell Inc. To manage notifications about this bug go to: https://bugs.launchpad.net/dell-poweredge/+bug/1331513/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp