On Tue, 18 Feb 2014, Zhang Weiwu wrote:

I have exclude another possibility.

I am thinking:

        1) perhaps the message in /var/log/messages is not produced by init,
        but by reboot/halt/shutdown, and

        2) perhaps init is not invoked at all.

So I run 'init 6' as root. This time, there is no new message in /var/log/messages, prooving 1), and 'init 6' did absolutely nothing, disprooving 2).

I was wrong. init 6 behave differently than reboot/halt/shutdown. It did shutdown a lot of services - my last post was sent a few seconds too early.

Among the services 'init 6' shutdown (which reboot/halt/shutdown failed) are:

- (in /etc/rc.6.d) apache2 - (in /etc/rc.6.d) mysql
- (in /etc/rc.6.d) exim4

The services 'init 6' did NOT shutdown are:

- portmap (manual break "/etc/rc6.d/K06portmap stop" worked)
- networking (because I can still establish new ssh connection to this server)
- rsyslogd

I have:

$ ls /etc/rc6.d/
K01apache2                K02mysql         K06portmap     K10lvm2
K01atd                    K03sendsigs      K07hwclock.sh  K11umountroot
K01exim4                  K04rsyslog       K07networking  K12reboot
K01urandom                K05umountnfs.sh  K08ifupdown    README
K01xe-linux-distribution  K06nfs-common    K09umountfs

My suspecision is that K03sendsigs failed, because K02* was terminated, K04* and K06portmap wasn't. K03sendsigs is in between. ps(1) shows sendsigs running:

$ ps ax | grep init
1 ? Ss 0:39 init [6] 19299 ? Ss 0:00 /bin/sh /etc/init.d/rc 6
19401 ?        S      0:00 /bin/sh /etc/init.d/sendsigs stop
23319 pts/9    S+     0:00 grep init

So the task is to figure out what sendsigs does and why it hangs.

There is no manual, so I go the hard way to read its source: It does the "Asking all remaining processes to terminate" thing.

So I suppose some daemon refuse to succumb, and sendsigs is waiting for it, or failed to kill nastily and is thus confused. I look at /var/run:

$ ls -F /var/run/
apache2/      ldapi@           portmap.pid    screen/        sshd.pid
crond.pid     motd             portmap.state  slapd/         utmp
crond.reboot  mysqld/          rpc.statd.pid  sm-notify.pid  xe-daemon.pid
exim4/        portmap_mapping  rsyslogd.pid   sshd/

portmap was manually stopped, therefore, daemons don't always remove pid before they leave, and the remaining files in /var/run does not indicate daemons who refuse to die.

Did sendsigs spit any error message? There were none in /var/log/syslog and /var/log/messages. Another user reported seeing error on screen from sendsigs while not able to finding it in both log files, so it is not logged there:
I am operating a remote server, there is no screen for me to see.

His problem may be the same as mine. As he solved it, he post:
http://forums.debian.net/viewtopic.php?f=5&t=63896

"A check forced of filesystem solved the problem."

I meditated for a while on this "check forced of filesystem", the grammar isn't correct and the whole sentence makes no sense. Does he mean "reboot -f" to force reboot? I have tried that and didn't make any difference than "reboot" without "-f". Does he mean manually umount all non-root filesystem? My /var/local is the only non-root physical file-system, and it is in use. 'sudo lsof /var/local' hangs there for 1 hour, so it remain a mystery which process is using it, but accessing its files is fine and error-free. Besides, there are various *umount* in /etc/rc.6d/ and they are all ordered after sendsigs, so they are not suposed to cause problem until sendsigs finishes.

So deadend again. Now as I browse through the process tree, I found one process that is started 2 weeks ago and should be long dead:

$ ps ax | grep youtu
18380 ?        D      0:03 python /usr/local/bin/youtube-dl

I distantly remember it had been run on a NFS mount which was jammed, and later, because umount not possible (NFS server gone), I had done lazy umount:
# umount -l /mnt/nfs

So I believe this one the culprit. "kill -9" cannot kill it, confirming my guess. https://wiki.debian.org/Kill says if you can't kill with "kill -9", you should reboot, which brings me back to this problem, chicken or egg first?

With no way to kill 18380 but to reboot, and no way to reboot but to kill 18380, I instead killed sendsigs with -TERM. The result is trouble: I was immediately kicked out of ssh session, server stopped to responding PING, and half an hour later I capitulated and called datacenter for a cold reboot.

After the server is online again, I immediately did a reboot and succeeded. So, it is very likely the stall process 18380 that stems reboot/shutdown.

My conclusion so far:

1. If you had an NFS mount, and NFS server is gone, you cannot umount it unless you reboot, which won't be successful and you need to do cold reboot.

2. You can get NFS mount out of sight with lazy umount (umount -l) but they are still there holding any process that uses it. I waited 2 weeks. It could be there forever.

3. If sendsigs cannot kill every process, killing itself doesn't help.



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/alpine.DEB.2.10.1402181152490.4922@lyonesse

Reply via email to