Package: ocfs2-tools Version: 1.4.1-1 Severity: important When unmounting an OCFS2 filesystem, unmount completes but ocfs2_hb_ctl segfaults before stopping the heartbeat which causes a situation that the OCFS2 cluster cannot be stopped before rebooting.
Steps to reproduce this situation: 1. Boot the system. 2. Test the /etc/init.d/ocfs2 and /etc/init.d/o2cb scripts by starting and stopping them. They stop successfully. This is what happens when an OCFS2 filesystem has not been mounted since the system has been booted up: server3:~# /etc/init.d/ocfs2 stop Stopping Oracle Cluster File System (OCFS2) OK server3:~# /etc/init.d/o2cb stop Stopping O2CB cluster ocfs2-www: OK Unmounting ocfs2_dlmfs filesystem: OK Unloading module "ocfs2_dlmfs": OK Unloading module "ocfs2_stack_o2cb": OK Unmounting configfs filesystem: OK Unloading module "configfs": OK 3. Start both ocfs2 and o2cb again and mount an OCFS2 filesystem. server3:~# mount /usr/ocfs2 Feb 24 11:36:50 server3 kernel: [ 529.764322] o2net: accepted connection from node server4 (num 3) at 192.168.1.4:7777 Feb 24 11:36:53 server3 kernel: [ 534.669832] OCFS2 1.5.0 Feb 24 11:36:53 server3 kernel: [ 534.693645] ocfs2_dlm: Nodes in domain ("7F6283D7CA09400B9ACB687AC1B70088"): 2 Feb 24 11:36:53 server3 kernel: [ 534.711619] kjournald starting. Commit interval 5 seconds Feb 24 11:36:53 server3 kernel: [ 534.714368] ocfs2: Mounting device (254,8) on (node 2, slot 0) with ordered data mode. 4. Unmount the filesystem server3:~# umount /usr/ocfs2 Feb 24 11:37:10 server3 kernel: [ 556.882125] ocfs2_hb_ctl[4552]: segfault at 0 ip 7fd38e033a90 sp 7fff96729328 error 4 in libc-2.7.so[7fd38dfb9000+14a000] Feb 24 11:37:10 server3 kernel: [ 556.884886] ocfs2: Unmounting device (254,8) on (node 2) While the filesystem is unmounted correctly, the heartbeat could not be stopped. So trying to stop the cluster fails. server3:~# /etc/init.d/ocfs2 stop Stopping Oracle Cluster File System (OCFS2) OK server3:~# /etc/init.d/o2cb stop Stopping O2CB cluster ocfs2-www: Failed Unable to stop cluster as heartbeat region still active As heartbeat could not be stopped, when the server is rebooted, other nodes print error messages until the server is up. An example is below: Feb 24 11:26:49 server4 kernel: [119051.392388] o2net: connection to node server3 (num 2) at 192.168.1.3:7777 has been idle for 10.0 seconds, shutting it down. Feb 24 11:26:49 server4 kernel: [119051.392448] (0,4):o2net_idle_timer:1468 here are some times that might help debug the situation: (tmr 1235467599.500913 now 1235467609.500099 dr 1235467599.500901 adv 1235467599.500918:1235467599.500918 func (dbc4763b:502) 1235467367.658695:1235467367.658701) Feb 24 11:26:49 server4 kernel: [119051.392573] o2net: no longer connected to node server3 (num 2) at 192.168.1.3:7777 Feb 24 11:26:59 server4 kernel: [119067.797445] (3802,4):o2net_connect_expired:1629 ERROR: no connection established with node 2 after 10.0 seconds, giving up and returning errors. I think the problem is in libc-2.7.so, because I see similar segfaulting when I stop multipath daemon too (which prints a similar message including libc-2.7.so) In any case, ocfs2 operation has problems on lenny. -- System Information: Debian Release: 5.0 APT prefers stable APT policy: (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 2.6.26-1-amd64 (SMP w/8 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages ocfs2-tools depends on: ii debconf [debconf-2.0] 1.5.24 Debian configuration management sy ii libc6 2.7-18 GNU C Library: Shared libraries ii libcomerr2 1.41.3-1 common error description library ii libglib2.0-0 2.16.6-1 The GLib library of C routines ii libncurses5 5.7+20081213-1 shared libraries for terminal hand ii libreadline5 5.2-3.1 GNU readline and history libraries ii libuuid1 1.41.3-1 universally unique id library ocfs2-tools recommends no packages. Versions of packages ocfs2-tools suggests: pn ocfs2console <none> (no description available) -- debconf information: ocfs2-tools/heartbeat_threshold: 31 ocfs2-tools/reconnect_delay: 2000 ocfs2-tools/init: false ocfs2-tools/keepalive_delay: 2000 ocfs2-tools/clustername: ocfs2 ocfs2-tools/idle_timeout: 30000 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org