On 04/07/14 09:02, Schweiss, Chip wrote: > Were you able to resolve the cause of this? > > I found this thread today when one of my servers started having the same > problem. Hard lockups with the nearly the same message on the console. > > My affected server is a Supermicro 6037R-TXRF with the X9DRX+-F > motherboard. It has 2 Intel nics. The console message reads: > > received unsolicited ack for DL_UNITDATA_REQ on igp0arl_dlpi_pending > > It repeats 4 times and the system locks. > > This system is completely unloaded and the problem occurs after only a few > minutes of uptime. > > Thanks for any additional info you can provide.
You replied to my message, and I don't think you were expecting an answer from me, but just in case: no, I haven't seen any resolution to this posted anywhere. I assume you typed the message above by hand. It says "igp" in your message, but I know of no such driver on any Solaris derivative. The closest match is "igb" (not "igp"), which is supposed to bind to the Intel i350 hardware. I believe that's what your motherboard has, but I don't have that motherboard, nor do I have the output of something definitive like "scanpci," so I can't tell for sure what's there. Looking at the code, it seems that this message could possibly be generated for exactly two cases: - driver sent DL_ERROR_ACK with dl_error_primitive == 7 - driver sent DL_OK_ACK with dl_correct_primitive == 7 (In all other cases, the primitive value we print is hard-wired to DL_NOTIFY_REQ, DL_INFO_REQ, or DL_BIND_REQ, so the message you see wouldn't have said "DL_UNITDATA_REQ.") In either event, we sure weren't expecting that to happen. My guess (and it's just a guess based on roughly 9 years working on that code in the past) is that the driver is erroneously sending DL_ERROR_ACK due to some internal race condition. It shouldn't be doing that, as all of these interfaces are supposed to be in DLPI "connectionless service" mode. Either way, it looks like the code inside ARP (buried in the IP module for somewhat historical reasons) does something entirely reasonable: it discards the message and drives on. I don't see how that code could cause the system to lock up. It might cause an affected Ethernet interface to become non-responsive if the IP module and the DLPI driver were out of sync about the current state (e.g., the driver thought it replied to a request that required a response, but the response was malformed so the IP module dropped it), but it shouldn't cause the system itself to lock up. In any event, it sounds like a driver bug of some sort. If careful dtrace work and source code inspection doesn't reveal the problem, then someone who has a system experiencing this problem is going to have to instrument the driver and find out what's going on. -- James Carlson 42.703N 71.076W <[email protected]> _______________________________________________ OpenIndiana-discuss mailing list [email protected] http://openindiana.org/mailman/listinfo/openindiana-discuss
