Hi, So I hope this is the right place to be asking this, this is my first time doing real kernel development for something useful, and this is long winded, I've spent a lot of time on it. Anyways, I am attempting to make the stmmac driver work on a HiSilicon HI3535 SoC (this is a SoC targeted at a Network video recorder application [arm cortex9 based]). Anyways, I found a kernel on github that boots and the stmmac driver works just fine, but it's a 3.4 kernel (link below). I've ported what I could forward, but the stmmac driver includes support for TCP offload and thus contains quite a bit of extra stuff, so for the stmmac driver I've gone to adding support for the SoC. I did manage to find the datasheet (in Chinese) for this chip, and nothing sticks out as different. With it I added the clocks and device tree stuff, and the driver mostly loads. The hardware appears to be dwmac1000/dwmac-3.610 (User ID: 0x10, Synopsys ID: 0x36), and from the other kernel, it also includes a "CreVinn TOE-NK-2G TCP Offload Engine". I've for the most part ported it, which has mostly been setting up the clocks for it (which I think/hope I did right). Also of note, this device has two GMACs one one controller (and they don't auto-detect right).
The kernel that I know works: https://github.com/uyhoangtran/linux-kernel-3.4-hi3535 For my actual problem, I am testing it by attempting to netboot with NFS over TCP, right now it comes up, sends out DHCP/configures the interface, and then kind of works. By that I mean it sends out some packets, but not all of the ones it should be sending actually go, it mounts my server, and from my NFS server I see many TCP packets with it communicating, and then it abruptly stops, and my server keeps re-transmitting trying to get it back. Eventually I get the following error: [ 244.050983] ------------[ cut here ]------------ [ 244.063088] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x234/0x238 [ 244.084632] NETDEV WATCHDOG: eth0 (stmmaceth): transmit queue 0 timed out [ 244.102332] CPU: 0 PID: 0 Comm: swapper Tainted: G W 4.19.0_hi3535-00055-g6218d4e6de03-dirty #455 [ 244.128833] Hardware name: Generic DT based system <snip the backtrace> My efforts to debug it has shown that adding a pr_warn() anywhere within stmmac_xmit() mostly solves the problem (and it doesn't matter where in that function, first line and last line results in the same thing). I thought this indicates some sort of race problem, and I've tried placing memory barriers all over that function and it does nothing. I've also found out that this seems to happen when netdev_tx_sent_queue() is called and it decides that the tx queue should be stopped. Then it seems like the tx queue isn't restarted and I don't know why. Also it appears that the next time stmmac_tx_clean() gets called it doesn't find all the bytes that the previous stmmac_xmit() sent (usually one to three packets short). I am basically out of ideas, other than switching to the latest 5.0 git branch, but I don't see anything that looks like it would fix this (no major changes in the stmmac driver at least, I went though every commit between the 4.19 and 5.0 and I don't see anything important). I suppose I'll try it next. So my two leading theories: #1 sort of race with DMA transfers, but dma memory barriers before all the important things already exist, and the driver already works on other systems, so I assume it's ok, plus the old working driver didn't make major changes with respect to these barriers (and I tried the changes it did make) #2 some sort of issue with how the netdev_* functions work, my investigation showed the queue is stopped because the BQL queue runs negative and there is a CONFIG_BQL option around all that code. But if that was the cause, I'd expect other drivers to have a problem, and I can find nothing on that issue. I can't seem to find where CONFIG_BQL is enabled so I assume it's required. So does anyone have any idea how I can debug this issue, I feel like there is something obvious I'm missing, I can absolutely share everything I have if someone wants to look through the changes I did make, I just didn't get around to hosting it somewhere yet. Is there something that's different about SoCs that I need to do.