exit status issue
Hello, I've got a script that is periodically performing a disk / raid array integrity check by touching a file. The touch is failing occasionally, with an exit status of 1. If I do 'strace touch' instead of just 'touch' the strace indicates an exit status of 0. So I'm not sure why this is getting propagated back as a 1 to the script. Would appreciate any insight you could offer. Here is my script and the strace output for when touch returns 1. fsaccesstest() { local RETRY_MAX=3 local RETRY_INTERVAL=3 local CHK_RESULT=1 TMP=`grep $1 /proc/mounts|awk '{print $1}'` if [ "$TMP" != "" ]; then for i in `seq 1 ${RETRY_MAX}` do touch $1/.accesstest CHK_RESULT=$? if [ ${CHK_RESULT} -eq 0 ] ; then break fi echo "*** fsaccesstest test $1 failed: status=${CHK_RESULT}. retrying... (${i}) ***" sleep ${RETRY_INTERVAL} done if [ $CHK_RESULT -ne 0 ]; then echo "*** fsaccess test $1 failed. ***" >/dev/console ## try to remount it. It may be busy, so retry if necessary. for j in `seq 1 ${RETRY_MAX}` do umount -f -l $1 # Force a lazy unmount CHK_RESULT=$? if [ ${CHK_RESULT} -eq 0 ]; then break fi sleep ${RETRY_INTERVAL} done if [ ${CHK_RESULT} -eq 0 ]; then mount $TMP $1 CHK_RESULT=$? if [ ${CHK_RESULT} -ne 0 ]; then echo "*** fsaccesstest failed to mount $1. ***" >/dev/console return ${CHK_RESULT} fi else echo "*** fsaccesstest failed to unmount $1. ***" >/dev/console return ${CHK_RESULT} fi fi return 0 fi return 1 } execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux", "SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"..., "HUSHLOGIN=FALSE", "USER=root", "PATH=/bin:/sbin:/usr/bin:/usr/sb"..., "MAIL=/var/spool/mail/root", "C_INCLUDE_PATH=/usr/include:/usr"..., "PWD=/root", "HOME=/root", "SHLVL=2", "LOGNAME=root", "_=/usr/local/bin/strace"]) = 0 brk(0) = 0x1702000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f459915 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f459914f000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_dev=makedev(8, 162), st_ino=90088, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=40, st_size=19548, st_atime=2011/11/15-08:39:02, st_mtime=2011/11/15-08:39:00, st_ctime=2011/11/15-08:39:01}) = 0 mmap(NULL, 19548, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f459914a000 close(3)= 0 open("/lib64/libcrypt.so.1", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\n\0"..., 832) = 832 fstat(3, {st_dev=makedev(8, 162), st_ino=285064, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=48, st_size=23096, st_atime=2011/11/15-08:10:31, st_mtime=2010/06/29-12:51:27, st_ctime=2011/11/15-08:07:41}) = 0 mmap(NULL, 1257888, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4598f03000 mprotect(0x7f4598f08000, 1044480, PROT_NONE) = 0 mmap(0x7f4599007000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4000) = 0x7f4599007000 mmap(0x7f4599009000, 184736, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4599009000 close(3)= 0 open("/lib64/libm.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360?\0\0"..., 832) = 832 fstat(3, {st_dev=makedev(8, 162), st_ino=285045, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=1064, st_size=538200, st_atime=2011/11/15-08:10:31, st_mtime=2010/06/29-12:51:27, st_ctime=2011/11/15-08:07:41}) = 0 mmap(NULL, 1569000, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4598d83000 mprotect(0x7f4598e01000, 1048576, PROT_NONE) = 0 mmap(0x7f4598f01000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7e000) = 0x7f4598f01000 close(3)= 0 open("/lib64/libc.so.6", O_RDONLY)
Re: exit status issue
On Fri, Nov 18, 2011 at 6:16 AM, Geir Hauge wrote: > Add ''set -x'' at the start of the function and examine the output to see if > it actually runs touch from PATH. The strace output is showing that the correct 'touch' is being executed. [pid 6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"], [/* 14 vars */]) = 0 Process 6748 attached Process 13686 suspended [pid 6748] execve("/usr/bin/logger", ["logger", "-s", "-t", "diskmon", "-p", "local0.info", "*** fsaccesstest test /mnt/array"...], [/* 14 vars */]) = 0 Process 13686 resumed Process 6747 detached [pid 13686] --- SIGCHLD (Child exited) @ 0 (0) --- Process 13686 suspended diskmon: *** fsaccesstest test /mnt/array1 failed: status=1. retrying... (1) *** Process 13686 resumed Process 6748 detached --- SIGCHLD (Child exited) @ 0 (0) --- Notice that the child process pid=6747 exits with status 0, but $?=1 after returning from executing touch. This triggers the failure. I just upgrade my bash from version 3.21 to 4.2 and I'm still seeing this problem. I am seeing this failure whether I execute the 'touch' command or the 'stat' command. Both of these are incidentally busybox implementations. ls -al /bin/touch lrwxrwxrwx1 root root7 Nov 15 08:07 /bin/touch -> busybox ls -al /bin/stat lrwxrwxrwx1 root root7 Nov 15 08:07 /bin/stat -> busybox Are there any bash debugging tools available that I can use to halt on a data breakpoint and examine state? Thanks, Dallas
Re: exit status issue
On Fri, Nov 18, 2011 at 2:07 PM, Greg Wooledge wrote: >> Notice that the child process pid=6747 exits with status 0 > > Where does it say that? How are you even invoking strace here? I don't > know how to read this output, nor do I know how to reproduce it. [pid 6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"], [/* 14 vars */]) = 0 The = 0 at the end is the exit status. I know it's complicated but this script is run as a daemon. Was using strace with -f option to follow through forks, i.e. strace -f -e execve ./diskmon_exec.sh start > /mnt/array1/share/strace.txt 2>&1 >> I just upgrade my bash from version 3.21 to 4.2 and I'm still seeing >> this problem. > > Simplify the script, and reproduce the problem with something we can > actually read and understand. For example, > > #!/bin/bash > f() { > touch /root/nopermission > echo "\$? is $?" > } > f > If you can't reproduce the problem in the simpler script, then you know > the problem is being triggered by something in the more complex script. > That will help you narrow down the cause. > Yes, I agree this is the right way to approach it. However, I have not been able to reproduce the problem with a simple function as you have shown. It is almost definitely interaction with the rest of what is going on in the complicated script. In fact, as I comment out portions of it, the problem goes away. But I can't put my finger on any one thing that is the root cause. It feels like something is getting clobbered in the execution context. This failure does not happen every time this function is executed. It is very hit and miss. Currently I am running bash -x to see if I can get it to happen that way.
Re: exit status issue
On Fri, Nov 18, 2011 at 2:35 PM, Greg Wooledge wrote: > On Fri, Nov 18, 2011 at 02:23:54PM -0600, Dallas Clement wrote: >> [pid 6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"], >> [/* 14 vars */]) = 0 >> >> The = 0 at the end is the exit status. > > That is the return value of the execve() call. All it means is that > the /bin/touch program was successfully launched. > > $ strace -f bash -c false > ... > execve("/bin/bash", ["bash", "-c", "false"], [/* 32 vars */]) = 0 > ... > > arc3:~$ bash -c false; echo $? > 1 > Yes, you are sure right on all counts. The (0) in SIGCHLD (Child exited) @ 0 (0) is clearly not the exit status of the child either. Proven by a simple test script. #!/tmp/bash3 touch /bogus/file echo $? - root@TS-2RVED8:/tmp# strace -f -e execve ./test.sh execve("./test.sh", ["./test.sh"], [/* 15 vars */]) = 0 Process 7269 attached Process 7268 suspended [pid 7269] execve("/bin/touch", ["touch", "/bogus/file"], [/* 14 vars */]) = 0 touch: /bogus/file: No such file or directory Process 7268 resumed Process 7269 detached --- SIGCHLD (Child exited) @ 0 (0) --- 1 Process 7268 detached Thanks for setting me straight guys. I was misled. It looks like the 'touch' command really could be failing in my script.
Re: exit status issue
It turns out that my 'touch' command is actually getting executed, but not when I expect it. Here is a recap of my script function which is occasionally misbehaving: # $1 mntpoint fsaccesstest() { local RETRY_MAX=3 local RETRY_INTERVAL=3 local CHK_RESULT=1 TMP=`grep $1 /proc/mounts|awk '{print $1}'` if [ "$TMP" != "" ]; then for i in `seq 1 ${RETRY_MAX}` do touch $1/.accesstest CHK_RESULT=$? if [ ${CHK_RESULT} -eq 0 ] ; then break fi logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** fsaccesstest test $1 failed. retrying... (${i}) ***" sleep ${RETRY_INTERVAL} done if [ $CHK_RESULT -ne 0 ]; then logger -t ${LOGTAG} -p ${LOGFACILITY} "*** fsaccesstest test $1 failed. ***" echo "*** fsaccess test $1 failed. ***" >/dev/console ## try remount it umount $1 mount $TMP $1 return $? fi return 0 fi return 1 } Most of the time this function executes the 'touch' command without errors as shown in the following strace output. The touch command make just three system calls to open(), close(), and utimes() which you can clearly see. Success Scenario [pid 23930] execve("/tmp/test-touch", ["/tmp/test-touch", "/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux", "SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"..., "HUSHLOGIN=FALSE", "USER=root", "PATH=/bin:/sbin:/usr/bin: [pid 23930] brk(0) = 0x18d2000 [pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9676000 [pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9675000 [pid 23930] access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) [pid 23930] open("/etc/ld.so.cache", O_RDONLY) = 3 [pid 23930] fstat(3, {st_dev=makedev(8, 162), st_ino=90088, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=40, st_size=19548, st_atime=2011/11/19-08:39:02, st_mtime=2011/11/15-08:39:00, st_ctime=2011/11/ [pid 23930] mmap(NULL, 19548, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4ab967 [pid 23930] close(3)= 0 [pid 23930] open("/lib64/libc.so.6", O_RDONLY) = 3 [pid 23930] read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\333\1\0"..., 832) = 832 [pid 23930] fstat(3, {st_dev=makedev(8, 162), st_ino=285057, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=2616, st_size=1331216, st_atime=2011/11/19-08:10:31, st_mtime=2010/06/29-12:51:27, st_ctime=201 [pid 23930] mmap(NULL, 2343048, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4ab932 [pid 23930] mprotect(0x7f4ab9454000, 1044480, PROT_NONE) = 0 [pid 23930] mmap(0x7f4ab9553000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x133000) = 0x7f4ab9553000 [pid 23930] mmap(0x7f4ab9558000, 16520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9558000 [pid 23930] close(3)= 0 [pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab966f000 [pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab966e000 [pid 23930] arch_prctl(ARCH_SET_FS, 0x7f4ab966e6f0) = 0 [pid 23930] mprotect(0x7f4ab9553000, 12288, PROT_READ) = 0 [pid 23930] munmap(0x7f4ab967, 19548) = 0 [pid 23930] open("/mnt/array1/.accesstest", O_RDWR|O_CREAT, 0666) = 3 [pid 23930] close(3)= 0 [pid 23930] utimes("/mnt/array1/.accesstest", {4196394, 0}) = 0 [pid 23930] exit_group(0) = ? Process 23756 resumed Process 23930 detached <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23930 --- SIGCHLD (Child exited) @ 0 (0) --- In the failure scenario, I had wrongly assumed that the touch command was not being executed, when in fact it is. It's just that the execution is deferred somehow and occurs after the call to 'logger', which would not have been executed had the call to 'touch' somehow failed or returned too soon. So it looks I am being afflicted with out of order execution problems. I am expecting that if my script makes a call to 'touch' it should not return until the command has been completely executed, i.e. the child process reaped, and exit status determined. This doesn't seem to be happening. Please forgive for the volume of data. Failure Scenario [pid 26415] execve("/tmp/test-touch", ["/tmp/test-touch", "/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux", "SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"..., "HUSHLOGIN=FALSE", "USER=r
Re: exit status issue
> That makes no sense. Fix your quoting (it's atrocious) and then run > the function with "set -x". Don't throw strace at it, at this point. > It's really not the right tool at this level of debugging. Okay, I simplified the script slightly, fixed the quoting and re-ran with set -x. Here is the script and the output for the failure scenario. # $1 mntpoint fsaccesstest() { set -x local RETRY_MAX=3 local RETRY_INTERVAL=3 local CHK_RESULT=1 local RETVAL=0 TMP=`grep "$1" /proc/mounts|awk '{print $1}'` if [ "$TMP" != "" ]; then for i in `seq 1 ${RETRY_MAX}` do touch "$1"/.accesstest CHK_RESULT=$? if [ ${CHK_RESULT} -eq 0 ] ; then break fi logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** fsaccesstest test $1 failed CHK_RESULT=$CHK_RESULT. retrying... (${i}) ***" sleep ${RETRY_INTERVAL} done if [ $CHK_RESULT -ne 0 ]; then RETVAL=$CHK_RESULT fi fi set +x return $RETVAL } + local RETRY_MAX=3 + local RETRY_INTERVAL=3 + local CHK_RESULT=1 + local RETVAL=0 ++ grep /mnt/array1 /proc/mounts ++ awk '{print $1}' + TMP=/dev/md10 + '[' /dev/md10 '!=' '' ']' ++ seq 1 3 + for i in '`seq 1 ${RETRY_MAX}`' + touch /mnt/array1/.accesstest + CHK_RESULT=1 + '[' 1 -eq 0 ']' + logger -s -t diskmon -p local0.info '*** fsaccesstest test /mnt/array1 failed CHK_RESULT=1. retrying... (1) ***' diskmon: *** fsaccesstest test /mnt/array1 failed CHK_RESULT=1. retrying... (1) *** + sleep 3 + for i in '`seq 1 ${RETRY_MAX}`' + touch /mnt/array1/.accesstest + CHK_RESULT=0 + '[' 0 -eq 0 ']' + break + '[' 0 -ne 0 ']' + set +x > I don't even know what your actual *symptom* is. I can't deduce it from > a spew of strace output. You haven't described the reason for the touch > command, so I can only presume this is some sort of autofs environment, > hence my attempt to solve the issue, above. The purpose of this function is simply to try and create or modify a test file on a RAID share. It's just a periodic integrity check. The set -x output shows that the 'touch' command failed, but it doesn't show why. I certainly acknowledge the strace output is a lot of stuff to untangle, but it does show what is happening at a process level and the return status for every system call. What I'm seeing is that there are lots of concurrent processes running at the time this particular shell script function is executed. The other processes seem to have an adverse effect on the process that is executing 'touch'. The three system calls that touch makes are all successful. So my question is why is $? getting set to 1? It looks to me like the bash interpreter is returning early from this call to 'touch'. I can't prove this yet though. Here's another example of an unsuccessful touch execution. Notice how the execution keeps getting interrupted and then resumed. [pid 31400] execve("/tmp/test-touch", ["/tmp/test-touch", "/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux", "SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"..., "HUSHLOGIN=FALSE", "USER=root", "PATH=/bin:/sbin:/usr/bin:/usr/sb"..., "MAIL=/var/spool/mail/root", "C_INCLUDE_PATH=/usr/include:/usr"..., "PWD=/usr/local/sbin", "HOME=/root", "SHLVL=2", "LOGNAME=root", "_=/tmp/test-touch"]Process 31401 attached (waiting for parent) [pid 31400] <... execve resumed> ) = 0 [pid 31400] brk(0 [pid 31400] <... brk resumed> ) = 0x2369000 [pid 31400] open("/mnt/array1/.accesstest", O_RDWR|O_CREAT, 0666 [pid 31400] <... open resumed> )= 3 [pid 31400] close(3 [pid 31400] <... close resumed> ) = 0 [pid 31400] utimes("/mnt/array1/.accesstest", {4196394, 0} [pid 31400] <... utimes resumed> ) = 0 [pid 31400] exit_group(0) = ? Process 23756 resumed Process 31400 detached [pid 23756] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 31400 [pid 23756] --- SIGCHLD (Child exited) @ 0 (0) --- [pid 23756] wait4(-1, Process 23756 suspended
Re: exit status issue
>> + touch /mnt/array1/.accesstest >> + CHK_RESULT=1 > > It looks to me that touch is failing and reporting the failure and > that bash is handling it correctly. > >> + touch /mnt/array1/.accesstest >> + CHK_RESULT=0 > > And then on a subsequent pass touch is reporting success. > > This doesn't (yet) look like a problem with bash. Admittedly bash seems to do the right thing if you go by the set -x execution trace. If you go by that, it would indeed seem that the call to touch is failing. But the strace output tells a different story. According to it, the system calls that touch makes are succeeding. In fact, I created my own 'touch' and 'stat' programs, both of which log any errors. I never see any errors logged, so I'm pretty sure they are getting executed which jives with what I'm seeing in the strace output. Is this a lower level scheduling problem? It certainly could be. I'm testing on Linux 2.6.39.4. I'll rollback to an older kernel, libc, and libpthread and see if that makes any difference. > >> The purpose of this function is simply to try and create or modify a >> test file on a RAID share. It's just a periodic integrity check. > > If I ever had even one single failure on a raid filesystem I would > start replacing hardware. I wouldn't be trying to retry. I wouldn't > be trying to umount and mount it again. ANY failures would mean > something so severely broken that I would consider the system unusable > until repaired. Yep, I agree - these sort of failures should not happen unless there is something terribly wrong. > > On the other hand, do you really mean an NFS mounted filesystem? > Perhaps using the nfs automounter? If so then say so. Don't say > "share" which implies a MS Windows SMB / CIFS network mount. The NFS > client subsystem is notoriously bad. The nfs automounter is worse. I > am guessing you probably really mean that you are using an nfs mounted > filesystem using the automounter. Pretty much everyone in that > situation has tried to recover from problems with it at one time or > another. Not doing anything with NFS mounting thankfully. It's a RAID array on a NAS device which is typically used for SMB / CIFS network mount.
Re: exit status issue
Bob, Yes I agree with everything you've said. I am grasping at straws, which I really don't like. It's very unlikely that the problems are with any of these things I've mentioned bash, Linux kernel, compiler, libc etc. This script is fairly large and it does lots of checking of block devices, raid arrays, and volumes. All of it is executed synchronously which should make it simple to debug. The script itself is executed in the background as a daemon process. This script is executed on locally mounted RAID devices. No network involved in the test. I'll continue my search for root cause with a more disciplined approach. Thanks for the help.
Re: exit status issue
Hi All, Just wanted to let you know that I have found a solution to my problem, though I can't entirely explain it yet. As mentioned earlier, this script was designed to run as a daemon, i.e. run in the background. This is what I was doing to daemonize the script: Daemon() { while : do checkStatus sleep 1 done } trap CleanupExit INT trap CleanupExit TERM trap CleanupExit ILL trap CleanupExit QUIT trap CleanupExit ABRT trap CleanupExit KILL if [ x"$1" = x"start" ]; then Daemon 0<&- 1>/dev/null 2>&1 & echo $! > $PIDFILE elif [ x"$1" = x"stop" ]; then TMP=`cat $PIDFILE` kill $TMP CleanupExit else : fi When running like this, I was seeing 'touch' command failures inside function fsaccesstest() shown previously. I did an experiment last night and ran this script in the foreground (removed the &) from the call to Daemon. To my surprise, I did not see any errors with the 'touch' command. I then went back and changed how I was turning this script into a daemon, by using the start-stop-daemon utility instead like this: start-stop-daemon --start --background --quiet --exec $DAEMON Once I did this, my shell script is now running in the background without any errors. Life is good. Perhaps when I was calling Daemon & as shown above, the parent process was not exiting properly and was somehow interfering in the child process execution. I'm not really sure. Anyhow, I'm past this problem now. Thank you all for the good ideas and generous help.