On Mon, Jul 15, 2019 at 08:55:35AM +0200, Raimo Niskanen wrote: > On Tue, Jul 02, 2019 at 05:13:43PM +0000, Stuart Henderson wrote: > > On 2019-07-02, Raimo Niskanen <[email protected]> wrote: > > > Hi misc@! > > > > > > If anyone has got some tips about how to debug two hanging machines we > > > have > > > in our test lab I am eager to learn. > > > > > > The machines runs 6.5, amd64 and are patched up to 005_libssl using > > > M:Tier's > > > openup. Other than that they are rather different, one small Zotac > > > ZBox-AD02 with AMD E-350 at 1.6 GHz, and one rack mounted Dell PowerEdge > > > R230 with Intel Xeon E3-1220. > > > > > > The overall symptoms are that it is possible to switch screens using > > > Alt+Ctrl+F1..Fn, but when logging in as root the greeting prints but no > > > prompt. Alt+Ctrl+Del does not work. The power button does not work. I > > > have to long press the power button to force power off. > > > > > > This happens during our nightly tests, that are quite resource intesive. > > > > > > In /var/log/messages I find suspicious entries "/bsd: proc: table is full" > > > possibly before the machines become inresponsive, but these entries appear > > > many more times before that point. And after this "table is full" message > > > there are many syslog entries; on one machine smartd constatly complains > > > about > > > an unreadable (pending) sector and atascsi_passthru_done timeout, and on > > > the other the kernel complains about a probed monitor but no|invalid EDID. > > > > > > So it seems the machine is out of some resource and fails to spawn a login > > > shell. Any clues to how I can find more details and a remedy? I suspect > > > a > > > full process table, but wonder how to detect and|or avoid that. > > > > > > I have considered having systat running on a console screen but do not > > > know > > > which systat display that might tell me anything. > > > > > > Best regards > > > > "/bsd: proc: table is full" means that the process table is full, but it > > doesn't > > tell you what caused this. > > > > The process table size is controlled by kern.maxproc, it is possible > > that the default is insufficient for your needs, but it's also possible > > that there was a build-up of processes that didn't exit due to another > > problem on the system. > > > > I would leave top(1) running on the system, and also save "ps ax" output > > regularly, then look at that output in the run-up to a failure, to see > > if that gives clues. > > > > It seems that the full process table is a secondary symptom, and that there > is something else that happens on the machines a few hours before the > process table fills... > > On one machine I hade left "systat pigs" running, and the last thing it > showed was about 90% for softnet and the rest <idle>, IIRC. > > I have now corrected a presumably unrelated error in our nightly tests that > occured just before the freeze. The test started a child process that was > abandoned, and when it noticed its controlling socket close it started to > write an error log. Previously that froze sometimes and a few hours later > the process table got full. Now the child process is not abandoned, and > I have not seen the freeze since... > > Still chasing ghosts, this can simply not be over yet.
A new hang, I tried to invstigate: At July 19 the last log entry from my 'ps' log was from 14:55, which is also the time on the 'systat vmstat' screen when it froze. Then the machine hums along but just after midnight at 00:42:01 the first "/bsd: process: table is full" entry appears. That message repeats until I rebooted it today at July 29 10:48. I had a terminal with top running. It was still updating. It showed about 98% sys and 2% spin on one of 4 CPUs, the others 100% idle. Then (after the process table had gotten full) it had 1282 idle processes and 1 on processor, which was 'top' itself. Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G. I had 8 shells under tmux ready for debugging. 'ls worked. 'systat' on one hung. 'top' on another failed with "cannot fork". 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac and then hung. 'exec reboot' did not succeed. Neither did a short power button, that at least caused a printout "stopping daemon nginx(failed)", but got no further. I had to do a hard power off. My theory now is that our daily tests right before 14:55 started a process (this process is the top 'top' process with 10:14 execution time) that triggers a lock or other contention problem in the kernel which causes one CPU to spin in the system, and blocks processes from dying. About 10 hours later the process table gets full. Any, ANY ideas of how to proceed would be appreciated! Best Regards -- Raimo Niskanen, Erlang/OTP, Ericsson AB

