Adam Brenner wrote: > The issue I am facing is that at random, unrepeatable times, the server > locks up and requires a reboot. However, none of the system generated > logs in /var/logs/ report any kernel panic or memory dump. I have ran a > number of grep commands and even manually spent time tracing the logs at > the time of day and nothing shows up. I ended up performing a chassis > swap (replaced motherboard, CPU, memory, PSU, etc). Yet, this still occurs.
What type of hardware is your "server"? I often use inexpensive consumer grade systems for servers. I am using a Raspberry Pi as a server for one application. (Hard to be greener at 2.5 watts for that particular need.) But the hardware in those is not as robust as hardware designed for enterprise environments with ECC throughout and other high quality hardware. I still use them for that function but the word server could mean many different things to different people. What type of hardware do you have? > This leads me to believe that the Rsyslog is not accurately logging > messages. I have been in the same unfortunate situation many times. Machines crash. Nothing in the logs. In the fortunate cases where I had a physical console I about half the time would have a kernel panic message to the console. About half the time there was nothing useful logged to the console. That illustrates the problem with relying upon syslog. Syslog is good for reporting userland events. But syslog is not very good for reporting kernel panics. When the kernel has fallen down userland space stops running and syslog is just another userland program. Syslog will stop running too. > Is it the "delayed" logging "dashes" a cause of the no logs? Not in my experience. Nothing you change there will have any effect. Instead I would monitor the hardware console if that is possible or practical. Does your server have remote console capability such as LOM or iLO? (https://en.wikipedia.org/wiki/Lights_out_management) > Anyone have ideas about this? There a many ways that things can fail. Without knowing how your system has failed it is impossible to say anything intelligent about it. It could be anything. And unfortunately I have been in the same place many times myself over the years and I don't have any great advice for debugging it either. But problems like this are one of the reasons that enterprise customers feel justified in paying so much money for high quality hardware with a support contract. That way they have someone to call and complain and to swap hardware until the problem stops. Since you have swapped the hardware and still have the problem I would assume it is a software issue and not a hardware issue. I would try an older kernel. Instead of the newest 3.14.5-1 I would try the 3.2.0 kernel from Wheezy. I would try the still supported 2.6.32 kernel from Squeeze. Recent upstream Linux development has has a large change in what hardware is well supported. The older kernels might work better. Unfortunately if they do then you are still faced with the problem of being able to upgrade once the support for those older kernels expires. But that may be better than the alternative. You said you migrated from RHEL/CentOS. Was that on this same hardware or different hardware? If the same hardware then I would suspect that the newer kernels are the problem. For another thing to think about I know some people run very slim dom0 systems on the bare hardware and then devote the rest of the system to the domU guest user system. Basically using virtualization to create an insulating layer around everything. I am just throwing that out there as a brainstorm idea. I would consider it if the problem was a software one on otherwise known good hardware. However it would be a large paradigm shift and not a trivially easy thing to switch on underneath your existing system. Bob
signature.asc
Description: Digital signature