Hello David, >> When you say server crashes - I assume kernel crashed and is not >> functioning (as opposed to >> kernel is up and running but network is not doing what I want)? Does server >> respond to any >> keyboard or mouse movement on console? > > It's hard to know for sure exactly what's happening, as the server is > headless. It does have a > KVM-over-IP hooked up to it, though. But when I try use the KVM when the > server gets in this state, > it doesn't show anything on the screen, or respond to any keyboard or mouse > input. (At least not > that I can see.) Eventually I wind up having to just power off the machine. >
Does your server support IPMI and thus SOL (serial over LAN)? If so you could configure your server to also provide a login on the serial device which it will also use to log kernel messages. - Configure SOL in EFI - Add "console=tty console=ttyS0,115200n8" (if SOL uses ttyS0/COM1, else use ttyS1) to the kernel command line (how to depends on your boot mechanism) - Configure BMC credentials and privileges for SOL - Use "ipmitool -I lanplus -U $USER -P $PASSWORD -H $SERVERIP sol activate" from another host in your network to connect. You can exit with "<enter> ~ ." Alternatively use "ipmiconsole -u $USER -p $PASSWORD -h $HOST" from freeipmi package and exit with "ctrl+esc & ." (if I remember correctly). In order to catch those hangs it then would be best to have SOL connected and logged around the clock, e.g. by using a Rasperry Pi and Conserver (https://aur.archlinux.org/packages/conserver/) Regards, Uwe > >> What do kernel logs show leading up to the crash? If the kernel oopsed >> it is often logged. >> Prior log messages may be helpful in tracking down that problem. > > When the crash happened this time I went back and looked at the logs and > system journal. Last log > message was at 2.59:01am. Timing is a little suspicious, as I do have some > cron jobs that kick off > at 3am. However, the first time the crash happened it was during the day, so > I don't think 3am cron > jobs could be solely responsible. > > >> Is server running desktop or no graphics? > > It's mostly a server. But I do have a desktop manager running (using the > CPU's onboard graphics) > and connected to the KVM in case I need to log in. (Which I rarely do. I > mostly SSH in.) > > >> Your network topology - Is the Netgear routing to internet or is it >> your new server or a >> separate firewall? Is your internet ipv4 or ipv6 or both? Same question - >> is internet firewall >> routing (ipv6) or NAT'ing (ipv4)? > > Netgear router is home LAN's gateway to the internet. It uses a combination > of ipv4 and v6. > > >> Is there a periodicity to the problem? Like say dhcp lease expiration >> length or something? > > Not that I can tell. Again, the issue only happened twice - once at around > 10am, and the second > time at (it looks like) 3am. And there was probably a good week or so > between the 2 events. > > >> Can we assume that Netgear has been working fine with same configs >> before new server deployed? > > Yep. Router had been pretty solid up until this point. Plus the fact that > unplugging the server's > network cable makes the issue go away leads me to believe the problem isn't > the router. > > >> Assuming the netgear is your firewall / router to internet - When things are >> "broken" can internal >> clients see each other (say ping from one client to another) or is all >> internal traffic hung up as >> well as internet traffic? > > Anything using an ethernet port on the router pretty much goes dead. That > includes the new server > itself, a network printer, the upstream modem that the router is connected > to, a POE wifi extender > in the other room, etc. IIRC, it might have still been possible to connect > to the router using wifi > during the outage. (I.e., it would respond and assign an IP address with > DHCP.) But with the > upstream modem unreachable, having a wifi connection wasn't of much use. > > >> And in that vein, sorry for obvious, but I'll ask anyway - can I assume >> only 1 server (or kea >> with hot-standby) is providing dhcp service? > > Yep, only 1 machine on the network handing out IP addresses - the router > itself. > > >> Also I notice that latest stable dd-wrt on website is r44715 and your >> build seems to be beta >> from last July - I note there are newer builds of the beta - I have no view >> on the firmware just >> making an observation. > > Yes, I am definitely a bit behind on dd-wrt updates. But the version I'm > running has been quite > stable up till now, so I didn't see any urgency to update. > > > One thing I did notice after doing a bit more digging on this issue: although > the r8169 network > module does seem to work with the mobo's onboard network chip (RTL8125B), > that's not technically the > right driver for it. There's an r8125 module that's not part of the kernel, > which is available on > AUR. (https://aur.archlinux.org/packages/r8125-dkms/) I've switched over to > start using that (and > blacklisted r8169). I've also upgraded to the most recent kernel (5.16.2). > So I'm watching to see > if either/both of those changes eliminate the issue. > > > I was mostly posting to the list really to ask if anyone had heard of such a > thing as a crashed > server somehow either sending screwed up network packets or flooding the > network in such a way that > it could render a router/switch inoperable. From the limited amount I know > of networking I think > that might be possible. But I don't know exactly how one would remedy > something like that. I guess > fixing the underlying issue that is crashing the server would be the way to > do that, but I haven't > been able to pin down the cause yet. > > Anyway, thanks again for the response, and I appreciate the debugging tips. > If any other ideas come > to mind, please LMK > > > Thanks, > > DR