Hello David,

>>      When you say server crashes - I assume kernel crashed and is not 
>> functioning (as opposed to
>> kernel is up and running but network is not doing what I want)? Does server 
>> respond to any
>> keyboard or mouse movement on console?
> 
> It's hard to know for sure exactly what's happening, as the server is 
> headless.  It does have a
> KVM-over-IP hooked up to it, though.  But when I try use the KVM when the 
> server gets in this state,
> it doesn't show anything on the screen, or respond to any keyboard or mouse 
> input.  (At least not
> that I can see.)  Eventually I wind up having to just power off the machine.
> 

Does your server support IPMI and thus SOL (serial over LAN)? If so you could 
configure your server
to also provide a login on the serial device which it will also use to log 
kernel messages.

- Configure SOL in EFI
- Add "console=tty console=ttyS0,115200n8" (if SOL uses ttyS0/COM1, else use 
ttyS1) to the kernel
command line (how to depends on your boot mechanism)
- Configure BMC credentials and privileges for SOL
- Use "ipmitool -I lanplus -U $USER -P $PASSWORD -H $SERVERIP sol activate" 
from another host in
your network to connect. You can exit with "<enter> ~ ."
Alternatively use "ipmiconsole -u $USER -p $PASSWORD -h $HOST" from freeipmi 
package and exit with
"ctrl+esc & ." (if I remember correctly).

In order to catch those hangs it then would be best to have SOL connected and 
logged around the
clock, e.g. by using a Rasperry Pi and Conserver 
(https://aur.archlinux.org/packages/conserver/)

Regards,

        Uwe



> 
>>       What do kernel logs show leading up to the crash? If the kernel oopsed 
>> it is often logged.
>> Prior log messages may be helpful in tracking down that problem.
> 
> When the crash happened this time I went back and looked at the logs and 
> system journal.  Last log
> message was at 2.59:01am.  Timing is a little suspicious, as I do have some 
> cron jobs that kick off
> at 3am.  However, the first time the crash happened it was during the day, so 
> I don't think 3am cron
> jobs could be solely responsible.
> 
> 
>> Is server running desktop or no graphics?
> 
> It's mostly a server.  But I do have a desktop manager running (using the 
> CPU's onboard graphics)
> and connected to the KVM in case I need to log in.  (Which I rarely do.  I 
> mostly SSH in.)
> 
> 
>>      Your network topology - Is the Netgear routing to internet or is it 
>> your new server or a
>> separate firewall?  Is your internet ipv4 or ipv6 or both? Same question - 
>> is internet firewall
>> routing (ipv6) or NAT'ing (ipv4)?
> 
> Netgear router is home LAN's gateway to the internet.  It uses a combination 
> of ipv4 and v6.
> 
> 
>>      Is there a periodicity to the problem? Like say dhcp lease expiration 
>> length or something?
> 
> Not that I can tell.  Again, the issue only happened twice - once at around 
> 10am, and the second
> time at (it looks like) 3am.  And there was probably a good week or so 
> between the 2 events.
> 
> 
>>      Can we assume that Netgear has been working fine with same configs 
>> before new server deployed?
> 
> Yep.  Router had been pretty solid up until this point.  Plus the fact that 
> unplugging the server's
> network cable makes the issue go away leads me to believe the problem isn't 
> the router.
> 
> 
>> Assuming the netgear is your firewall / router to internet - When things are 
>> "broken" can internal
>> clients see each other (say ping from one client to another) or is all 
>> internal traffic hung up as
>> well as internet traffic?
> 
> Anything using an ethernet port on the router pretty much goes dead. That 
> includes the new server
> itself, a network printer, the upstream modem that the router is connected 
> to, a POE wifi extender
> in the other room, etc.  IIRC, it might have still been possible to connect 
> to the router using wifi
> during the outage.  (I.e., it would respond and assign an IP address with 
> DHCP.)  But with the
> upstream modem unreachable, having a wifi connection wasn't of much use.
> 
> 
>>      And in that vein, sorry for obvious, but I'll ask anyway - can I assume 
>> only 1 server (or kea
>> with hot-standby) is providing dhcp service?
> 
> Yep, only 1 machine on the network handing out IP addresses - the router 
> itself.
> 
> 
>>       Also I notice that latest stable dd-wrt on website is r44715 and your 
>> build seems to be beta
>> from last July - I note there are newer builds of the beta - I have no view 
>> on the firmware just
>> making an observation.
> 
> Yes, I am definitely a bit behind on dd-wrt updates.  But the version I'm 
> running has been quite
> stable up till now, so I didn't see any urgency to update.
> 
> 
> One thing I did notice after doing a bit more digging on this issue: although 
> the r8169 network
> module does seem to work with the mobo's onboard network chip (RTL8125B), 
> that's not technically the
> right driver for it.  There's an r8125 module that's not part of the kernel, 
> which is available on
> AUR.  (https://aur.archlinux.org/packages/r8125-dkms/) I've switched over to 
> start using that (and
> blacklisted r8169).  I've also upgraded to the most recent kernel (5.16.2).  
> So I'm watching to see
> if either/both of those changes eliminate the issue.
> 
> 
> I was mostly posting to the list really to ask if anyone had heard of such a 
> thing as a crashed
> server somehow either sending screwed up network packets or flooding the 
> network in such a way that
> it could render a router/switch inoperable.  From the limited amount I know 
> of networking I think
> that might be possible.  But I don't know exactly how one would remedy 
> something like that.  I guess
> fixing the underlying issue that is crashing the server would be the way to 
> do that, but I haven't
> been able to pin down the cause yet.
> 
> Anyway, thanks again for the response, and I appreciate the debugging tips.  
> If any other ideas come
> to mind, please LMK
> 
> 
> Thanks,
> 
> DR

Reply via email to