On Tue, 18 Feb 2014, Zhang Weiwu wrote:
I have exclude another possibility.
I am thinking:
1) perhaps the message in /var/log/messages is not produced by init,
but by reboot/halt/shutdown, and
2) perhaps init is not invoked at all.
So I run 'init 6' as root. This time, there is no new message in
/var/log/messages, prooving 1), and 'init 6' did absolutely nothing,
disprooving 2).
I was wrong. init 6 behave differently than reboot/halt/shutdown. It did
shutdown a lot of services - my last post was sent a few seconds too early.
Among the services 'init 6' shutdown (which reboot/halt/shutdown failed)
are:
- (in /etc/rc.6.d) apache2
- (in /etc/rc.6.d) mysql
- (in /etc/rc.6.d) exim4
The services 'init 6' did NOT shutdown are:
- portmap (manual break "/etc/rc6.d/K06portmap stop" worked)
- networking (because I can still establish new ssh connection to this server)
- rsyslogd
I have:
$ ls /etc/rc6.d/
K01apache2 K02mysql K06portmap K10lvm2
K01atd K03sendsigs K07hwclock.sh K11umountroot
K01exim4 K04rsyslog K07networking K12reboot
K01urandom K05umountnfs.sh K08ifupdown README
K01xe-linux-distribution K06nfs-common K09umountfs
My suspecision is that K03sendsigs failed, because K02* was terminated, K04*
and K06portmap wasn't. K03sendsigs is in between. ps(1) shows sendsigs
running:
$ ps ax | grep init
1 ? Ss 0:39 init [6] 19299 ? Ss 0:00 /bin/sh
/etc/init.d/rc 6
19401 ? S 0:00 /bin/sh /etc/init.d/sendsigs stop
23319 pts/9 S+ 0:00 grep init
So the task is to figure out what sendsigs does and why it hangs.
There is no manual, so I go the hard way to read its source: It does the
"Asking all remaining processes to terminate" thing.
So I suppose some daemon refuse to succumb, and sendsigs is waiting for it, or
failed to kill nastily and is thus confused. I look at /var/run:
$ ls -F /var/run/
apache2/ ldapi@ portmap.pid screen/ sshd.pid
crond.pid motd portmap.state slapd/ utmp
crond.reboot mysqld/ rpc.statd.pid sm-notify.pid xe-daemon.pid
exim4/ portmap_mapping rsyslogd.pid sshd/
portmap was manually stopped, therefore, daemons don't always remove pid
before they leave, and the remaining files in /var/run does not indicate
daemons who refuse to die.
Did sendsigs spit any error message? There were none in /var/log/syslog and
/var/log/messages. Another user reported seeing error on screen from sendsigs
while not able to finding it in both log files, so it is not logged there:
I am operating a remote server, there is no screen for me to see.
His problem may be the same as mine. As he solved it, he post:
http://forums.debian.net/viewtopic.php?f=5&t=63896
"A check forced of filesystem solved the problem."
I meditated for a while on this "check forced of filesystem", the grammar
isn't correct and the whole sentence makes no sense. Does he mean "reboot -f"
to force reboot? I have tried that and didn't make any difference than
"reboot" without "-f". Does he mean manually umount all non-root filesystem?
My /var/local is the only non-root physical file-system, and it is in use.
'sudo lsof /var/local' hangs there for 1 hour, so it remain a mystery which
process is using it, but accessing its files is fine and error-free. Besides,
there are various *umount* in /etc/rc.6d/ and they are all ordered after
sendsigs, so they are not suposed to cause problem until sendsigs finishes.
So deadend again. Now as I browse through the process tree, I found one
process that is started 2 weeks ago and should be long dead:
$ ps ax | grep youtu
18380 ? D 0:03 python /usr/local/bin/youtube-dl
I distantly remember it had been run on a NFS mount which was jammed, and
later, because umount not possible (NFS server gone), I had done lazy umount:
# umount -l /mnt/nfs
So I believe this one the culprit. "kill -9" cannot kill it, confirming my
guess. https://wiki.debian.org/Kill says if you can't kill with "kill -9", you
should reboot, which brings me back to this problem, chicken or egg first?
With no way to kill 18380 but to reboot, and no way to reboot but to kill
18380, I instead killed sendsigs with -TERM. The result is trouble: I was
immediately kicked out of ssh session, server stopped to responding PING, and
half an hour later I capitulated and called datacenter for a cold reboot.
After the server is online again, I immediately did a reboot and succeeded.
So, it is very likely the stall process 18380 that stems reboot/shutdown.
My conclusion so far:
1. If you had an NFS mount, and NFS server is gone, you cannot umount it
unless you reboot, which won't be successful and you need to do cold reboot.
2. You can get NFS mount out of sight with lazy umount (umount -l) but they
are still there holding any process that uses it. I waited 2 weeks. It could
be there forever.
3. If sendsigs cannot kill every process, killing itself doesn't help.
--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/alpine.DEB.2.10.1402181152490.4922@lyonesse