There are a couple problems here: 1) the transmitter is getting hung. 2) the recovery logic doesn't work. If I can reproduce hang, then maybe the recovery code could be fixable.
Let's address the transmitter hang first. The transmitter has multiple stages so it could be either: a) hardware flow control problems look at ethtool -S eth0 statistics, are there flow control packets showing up? b) GMAC or ram buffer issues looking at 'ethtool -d eth0' output can help, but it is a needle in haystack finding these setup errors. The sky2 driver copies most of the stuff from vendor version of sk98lin, but if sk98lin works and sky2 doesn't then comparing register settings can give hints. c) DMA problems For some problems, I have had luck adding a /proc interface and dumping the transmit ring after a hang. Looking at the last control block that hung can help. This found the case where IPV6 TSO was leaking through. d) checksum problems Turning off tx scatter/gather forces non fragmented skb's. This hurts performance, but can tell if the problem is with fragment code. Turning off tx checksum turns off scatter/gather, checksumming and TSO. e) possible alignment and flow control interaction Because the receive DMA engine has hardware bugs and requires alignment or it doesn't work with flow control. I still wonder if there are alignment bugs on Tx with flow control. f) other driver bug To save time, I'll go get a new Mac Mini and try and clone this setup. Could you send me a full kernel config (and other setup information like filesystem type, distro etc). > -- I assume this is just the same problem exhibiting on a > kernel with soft lockups detection enabled? > > Hopefully I should be able to actually log into one of > these machines over an alternate connection next time the > problem recurs, at which point I should be able to get > ethtool -d output. Anything else I should do at that > point? > > Any suggestions for what to do next to chase this problem > down? I haven't yet tried the sk98lin driver on this > hardware; is that still worth doing? Are there any useful > tests we should try? Unfortunately, though these crashes > happen pretty frequently (several times per day > typically), I don't have a test case to reproduce one; > however, if it'd be useful, I can probably get a pcap > trace of the period immediately before the interface falls > over using port mirroring on the switch to which the > machines are connected. Is that likely to be informative? > The vendor driver does some slightly different setup, but it also does a hardware reset when inactive (every 10ms). -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html