On Tue, 10 Jan 2006, Erik Mouw wrote:
I have lots of transmit timeouts with an Intel E1000 card during large TCP transmissions (remotely viewing a 3000x2000 jpeg image using XV is an excellent way to trigger it). This is what I get in linux-2.6.8.1:
sorry to hear you're having a problem, and cool, thanks for the test, we'll have to try it here. We've classically had problems reproducing the athlon based hangs.
Jan 10 15:24:41 zurix kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 10 15:24:41 zurix kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex Jan 10 15:24:46 zurix kernel: nfs: server abra2 not responding, still trying Jan 10 15:24:46 zurix kernel: nfs: server abra2 OK And this is with linux-2.6.15: Jan 10 06:53:27 zurix kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Jan 10 06:53:27 zurix kernel: TDH <b0> Jan 10 06:53:27 zurix kernel: TDT <b0> Jan 10 06:53:27 zurix kernel: next_to_use <b0> Jan 10 06:53:27 zurix kernel: next_to_clean <c3> Jan 10 06:53:27 zurix kernel: buffer_info[next_to_clean] Jan 10 06:53:27 zurix kernel: dma <e938a5e> Jan 10 06:53:27 zurix kernel: time_stamp <872de93> Jan 10 06:53:27 zurix kernel: next_to_watch <c3> Jan 10 06:53:27 zurix kernel: jiffies <872e086> Jan 10 06:53:27 zurix kernel: next_to_watch.status <0>
ugh, I don't get it, there is no way in the code that I know of that we would not update TDT when we enqueued a transmit.
These problems (for us) seem to be related to TSO, can you attempt to disable it and try your test again, using
ethtool -K eth0 tso off
The system is a an AMD Athlon XP 2000+ running at 1.666 GHz with a VIA KT400 chipset (Asrock K7VT4APro).
ah yes, this is the famous one that seems to get lots of problem reports. You are running the latest bios, right? Seems lame but that has actually fixed problems here.
Here's the relevant output from lspci:
<snip>
So far I have replaced the NIC, the motherboard, the power supply, RAM, network cable, and gigE switch, but to no avail. I've tried three different kernels (2.6.8.1, 2.6.11-ac7, and 2.6.15) but the problem remains. I've been stress testing the system by continuously compiling kernels (over NFS), but after 288 runs there hasn't been a single error so I guess the CPU and RAM are OK. The amount of transmit timeouts is less with linux-2.6.8.1, so for the moment I keep running that version.
wow, thats a lot of work, I'm almost at the point of a personal crusade against these timeout issues. The biggest block we have to solving them is lack of reproduction locally.
We have about 15 other machines using the Intel E1000, but I haven't seen these kind of problems on any of the other machines. I have run out of ideas, so I hope somebody knows how to solve this. If you need more information, just let me know.
like i said, try disabling TSO and see if that helps. Please try driver 6.3.9 from prdownloads.sf.net/e1000 and see if that changes anything too.
Thanks, Jesse - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html