exit status issue

2011-11-17 Thread Dallas Clement
Hello,

I've got a script that is periodically performing a disk / raid array
integrity check by touching a file.  The touch is failing
occasionally, with an exit status of 1.  If I do 'strace touch'
instead of just 'touch' the strace indicates an exit status of 0.  So
I'm not sure why this is getting propagated back as a 1 to the script.
  Would appreciate any insight you could offer.  Here is my script and
the strace output for when touch returns 1.

fsaccesstest()
{
local RETRY_MAX=3
local RETRY_INTERVAL=3
local CHK_RESULT=1

TMP=`grep $1 /proc/mounts|awk '{print $1}'`
if [ "$TMP" != "" ]; then
for i in `seq 1 ${RETRY_MAX}`
do
touch $1/.accesstest
CHK_RESULT=$?
if [ ${CHK_RESULT} -eq 0 ] ; then
break
fi
echo "*** fsaccesstest test $1 failed: 
status=${CHK_RESULT}.
retrying... (${i}) ***"
sleep ${RETRY_INTERVAL}
done
if [ $CHK_RESULT -ne 0 ]; then
echo "*** fsaccess test $1 failed. ***" >/dev/console
## try to remount it.  It may be busy, so retry if 
necessary.
for j in `seq 1 ${RETRY_MAX}`
do
umount -f -l $1 # Force a lazy unmount
CHK_RESULT=$?
if [ ${CHK_RESULT} -eq 0 ]; then
break
fi
sleep ${RETRY_INTERVAL}
done
if [ ${CHK_RESULT} -eq 0 ]; then
mount $TMP $1
CHK_RESULT=$?
if [ ${CHK_RESULT} -ne 0 ]; then
echo "*** fsaccesstest failed to mount 
$1. ***" >/dev/console
return ${CHK_RESULT}
fi
else
echo "*** fsaccesstest failed to unmount $1. 
***" >/dev/console
return ${CHK_RESULT}
fi
fi
return 0
fi
return 1
}

execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"],
["HOSTNAME=TS-2RVED8", "TERM=linux", "SHELL=/bin/sh",
"LIBRARY_PATH=/usr/lib:/usr/local"..., "HUSHLOGIN=FALSE", "USER=root",
"PATH=/bin:/sbin:/usr/bin:/usr/sb"..., "MAIL=/var/spool/mail/root",
"C_INCLUDE_PATH=/usr/include:/usr"..., "PWD=/root", "HOME=/root",
"SHLVL=2", "LOGNAME=root", "_=/usr/local/bin/strace"]) = 0
brk(0)  = 0x1702000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f459915
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f459914f000
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)  = 3
fstat(3, {st_dev=makedev(8, 162), st_ino=90088, st_mode=S_IFREG|0644,
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=40,
st_size=19548, st_atime=2011/11/15-08:39:02,
st_mtime=2011/11/15-08:39:00, st_ctime=2011/11/15-08:39:01}) = 0
mmap(NULL, 19548, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f459914a000
close(3)= 0
open("/lib64/libcrypt.so.1", O_RDONLY)  = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\n\0"..., 832) = 832
fstat(3, {st_dev=makedev(8, 162), st_ino=285064, st_mode=S_IFREG|0755,
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=48,
st_size=23096, st_atime=2011/11/15-08:10:31,
st_mtime=2010/06/29-12:51:27, st_ctime=2011/11/15-08:07:41}) = 0
mmap(NULL, 1257888, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f4598f03000
mprotect(0x7f4598f08000, 1044480, PROT_NONE) = 0
mmap(0x7f4599007000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4000) = 0x7f4599007000
mmap(0x7f4599009000, 184736, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4599009000
close(3)= 0
open("/lib64/libm.so.6", O_RDONLY)  = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360?\0\0"..., 832) = 832
fstat(3, {st_dev=makedev(8, 162), st_ino=285045, st_mode=S_IFREG|0755,
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=1064,
st_size=538200, st_atime=2011/11/15-08:10:31,
st_mtime=2010/06/29-12:51:27, st_ctime=2011/11/15-08:07:41}) = 0
mmap(NULL, 1569000, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f4598d83000
mprotect(0x7f4598e01000, 1048576, PROT_NONE) = 0
mmap(0x7f4598f01000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7e000) = 0x7f4598f01000
close(3)= 0
open("/lib64/libc.so.6", O_RDONLY) 

Re: exit status issue

2011-11-18 Thread Dallas Clement
On Fri, Nov 18, 2011 at 6:16 AM, Geir Hauge  wrote:
> Add ''set -x'' at the start of the function and examine the output to see if
> it actually runs touch from PATH.

The strace output is showing that the correct 'touch' is being executed.

[pid  6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"],
[/* 14 vars */]) = 0
Process 6748 attached
Process 13686 suspended
[pid  6748] execve("/usr/bin/logger", ["logger", "-s", "-t",
"diskmon", "-p", "local0.info", "*** fsaccesstest test
/mnt/array"...], [/* 14 vars */]) = 0
Process 13686 resumed
Process 6747 detached
[pid 13686] --- SIGCHLD (Child exited) @ 0 (0) ---
Process 13686 suspended
diskmon: *** fsaccesstest test /mnt/array1 failed: status=1. retrying... (1) ***
Process 13686 resumed
Process 6748 detached
--- SIGCHLD (Child exited) @ 0 (0) ---

Notice that the child process pid=6747 exits with status 0, but $?=1
after returning from executing touch.  This triggers the failure.

I just upgrade my bash from version 3.21 to 4.2 and I'm still seeing
this problem.

I am seeing this failure whether I execute the 'touch' command or the
'stat' command.  Both of these are incidentally busybox
implementations.

 ls -al /bin/touch
lrwxrwxrwx1 root root7 Nov 15 08:07 /bin/touch -> busybox

ls -al /bin/stat
lrwxrwxrwx1 root root7 Nov 15 08:07 /bin/stat -> busybox

Are there any bash debugging tools available that I can use to halt on
a data breakpoint and examine state?

Thanks,
Dallas



Re: exit status issue

2011-11-18 Thread Dallas Clement
On Fri, Nov 18, 2011 at 2:07 PM, Greg Wooledge  wrote:
>> Notice that the child process pid=6747 exits with status 0
>
> Where does it say that?  How are you even invoking strace here?  I don't
> know how to read this output, nor do I know how to reproduce it.

[pid  6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"],
[/* 14 vars */]) = 0

The = 0 at the end is the exit status.

I know it's complicated but this script is run as a daemon.  Was using
strace with -f option to follow through forks, i.e.

strace -f -e execve ./diskmon_exec.sh  start > /mnt/array1/share/strace.txt 2>&1

>> I just upgrade my bash from version 3.21 to 4.2 and I'm still seeing
>> this problem.
>
> Simplify the script, and reproduce the problem with something we can
> actually read and understand.  For example,
>
> #!/bin/bash
> f() {
>  touch /root/nopermission
>  echo "\$? is $?"
> }
> f

> If you can't reproduce the problem in the simpler script, then you know
> the problem is being triggered by something in the more complex script.
> That will help you narrow down the cause.
>

Yes, I agree this is the right way to approach it.  However, I have
not been able to reproduce the problem with a simple function as you
have shown.  It is almost definitely interaction with the rest of what
is going on in the complicated script.  In fact, as I comment out
portions of it, the problem goes away.  But I can't put my finger on
any one thing that is the root cause.  It feels like something is
getting clobbered in the execution context.  This failure does not
happen every time this function is executed.  It is very hit and miss.

Currently I am running bash -x to see if I can get it to happen that way.



Re: exit status issue

2011-11-18 Thread Dallas Clement
On Fri, Nov 18, 2011 at 2:35 PM, Greg Wooledge  wrote:
> On Fri, Nov 18, 2011 at 02:23:54PM -0600, Dallas Clement wrote:
>> [pid  6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"],
>> [/* 14 vars */]) = 0
>>
>> The = 0 at the end is the exit status.
>
> That is the return value of the execve() call.  All it means is that
> the /bin/touch program was successfully launched.
>
> $ strace -f bash -c false
> ...
> execve("/bin/bash", ["bash", "-c", "false"], [/* 32 vars */]) = 0
> ...
>
> arc3:~$ bash -c false; echo $?
> 1
>

Yes, you are sure right on all counts.  The (0) in SIGCHLD (Child
exited) @ 0 (0) is clearly not the exit status of the child either.
Proven by a simple test script.

#!/tmp/bash3

touch /bogus/file
echo $?

-

root@TS-2RVED8:/tmp# strace -f -e execve ./test.sh
execve("./test.sh", ["./test.sh"], [/* 15 vars */]) = 0
Process 7269 attached
Process 7268 suspended
[pid  7269] execve("/bin/touch", ["touch", "/bogus/file"], [/* 14 vars */]) = 0
touch: /bogus/file: No such file or directory
Process 7268 resumed
Process 7269 detached
--- SIGCHLD (Child exited) @ 0 (0) ---
1
Process 7268 detached

Thanks for setting me straight guys.  I was misled.  It looks like the
'touch' command really could be failing in my script.



Re: exit status issue

2011-11-22 Thread Dallas Clement
It turns out that my 'touch' command is actually getting executed, but
not when I expect it.  Here is a recap of my script function which is
occasionally misbehaving:

# $1 mntpoint
fsaccesstest()
{
local RETRY_MAX=3
local RETRY_INTERVAL=3
local CHK_RESULT=1

TMP=`grep $1 /proc/mounts|awk '{print $1}'`
if [ "$TMP" != "" ]; then
for i in `seq 1 ${RETRY_MAX}`
do
touch $1/.accesstest
CHK_RESULT=$?
if [ ${CHK_RESULT} -eq 0 ] ; then
break
fi
logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** 
fsaccesstest test $1
failed. retrying... (${i}) ***"
sleep ${RETRY_INTERVAL}
done
if [ $CHK_RESULT -ne 0 ]; then
logger -t ${LOGTAG} -p ${LOGFACILITY} "*** fsaccesstest 
test $1 failed. ***"
echo "*** fsaccess test $1 failed. ***" >/dev/console
## try remount it
umount $1
mount $TMP $1
return $?
fi
return 0
fi
return 1
}

Most of the time this function executes the 'touch' command without
errors as shown in the following strace output.  The touch command
make just three system calls to open(), close(), and utimes() which
you can clearly see.

Success Scenario

[pid 23930] execve("/tmp/test-touch", ["/tmp/test-touch",
"/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux",
"SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"...,
"HUSHLOGIN=FALSE", "USER=root", "PATH=/bin:/sbin:/usr/bin:
[pid 23930] brk(0)  = 0x18d2000
[pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9676000
[pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9675000
[pid 23930] access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such
file or directory)
[pid 23930] open("/etc/ld.so.cache", O_RDONLY) = 3
[pid 23930] fstat(3, {st_dev=makedev(8, 162), st_ino=90088,
st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096,
st_blocks=40, st_size=19548, st_atime=2011/11/19-08:39:02,
st_mtime=2011/11/15-08:39:00, st_ctime=2011/11/
[pid 23930] mmap(NULL, 19548, PROT_READ, MAP_PRIVATE, 3, 0) =
0x7f4ab967
[pid 23930] close(3)= 0
[pid 23930] open("/lib64/libc.so.6", O_RDONLY) = 3
[pid 23930] read(3,
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\333\1\0"..., 832) =
832
[pid 23930] fstat(3, {st_dev=makedev(8, 162), st_ino=285057,
st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096,
st_blocks=2616, st_size=1331216, st_atime=2011/11/19-08:10:31,
st_mtime=2010/06/29-12:51:27, st_ctime=201
[pid 23930] mmap(NULL, 2343048, PROT_READ|PROT_EXEC,
MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4ab932
[pid 23930] mprotect(0x7f4ab9454000, 1044480, PROT_NONE) = 0
[pid 23930] mmap(0x7f4ab9553000, 20480, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x133000) = 0x7f4ab9553000
[pid 23930] mmap(0x7f4ab9558000, 16520, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9558000
[pid 23930] close(3)= 0
[pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab966f000
[pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab966e000
[pid 23930] arch_prctl(ARCH_SET_FS, 0x7f4ab966e6f0) = 0
[pid 23930] mprotect(0x7f4ab9553000, 12288, PROT_READ) = 0
[pid 23930] munmap(0x7f4ab967, 19548) = 0
[pid 23930] open("/mnt/array1/.accesstest", O_RDWR|O_CREAT, 0666) = 3
[pid 23930] close(3)= 0
[pid 23930] utimes("/mnt/array1/.accesstest", {4196394, 0}) = 0
[pid 23930] exit_group(0)   = ?
Process 23756 resumed
Process 23930 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23930
--- SIGCHLD (Child exited) @ 0 (0) ---

In the failure scenario, I had wrongly assumed that the touch command
was not being executed, when in fact it is.  It's just that the
execution is deferred somehow and occurs after the call to 'logger',
which would not have been executed had the call to 'touch' somehow
failed or returned too soon.  So it looks I am being afflicted with
out of order execution problems.  I am expecting that if my script
makes a call to 'touch' it should not return until the command has
been completely executed, i.e. the child process reaped, and exit
status determined.  This doesn't seem to be happening.

Please forgive for the volume of data.

Failure Scenario

[pid 26415] execve("/tmp/test-touch", ["/tmp/test-touch",
"/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux",
"SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"...,
"HUSHLOGIN=FALSE", "USER=r

Re: exit status issue

2011-11-22 Thread Dallas Clement
> That makes no sense.  Fix your quoting (it's atrocious) and then run
> the function with "set -x".  Don't throw strace at it, at this point.
> It's really not the right tool at this level of debugging.

Okay, I simplified the script slightly, fixed the quoting and re-ran
with set -x.  Here is the script and the output for the failure
scenario.

# $1 mntpoint
fsaccesstest()
{
set -x

local RETRY_MAX=3
local RETRY_INTERVAL=3
local CHK_RESULT=1
local RETVAL=0

TMP=`grep "$1" /proc/mounts|awk '{print $1}'`
if [ "$TMP" != "" ]; then
for i in `seq 1 ${RETRY_MAX}`
do
touch "$1"/.accesstest
CHK_RESULT=$?
if [ ${CHK_RESULT} -eq 0 ] ; then
break
fi
logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** 
fsaccesstest test $1
failed CHK_RESULT=$CHK_RESULT. retrying... (${i}) ***"
sleep ${RETRY_INTERVAL}
done
if [ $CHK_RESULT -ne 0 ]; then
RETVAL=$CHK_RESULT
fi
fi

set +x

return $RETVAL
}

+ local RETRY_MAX=3
+ local RETRY_INTERVAL=3
+ local CHK_RESULT=1
+ local RETVAL=0
++ grep /mnt/array1 /proc/mounts
++ awk '{print $1}'
+ TMP=/dev/md10
+ '[' /dev/md10 '!=' '' ']'
++ seq 1 3
+ for i in '`seq 1 ${RETRY_MAX}`'
+ touch /mnt/array1/.accesstest
+ CHK_RESULT=1
+ '[' 1 -eq 0 ']'
+ logger -s -t diskmon -p local0.info '*** fsaccesstest test
/mnt/array1 failed CHK_RESULT=1. retrying... (1) ***'
diskmon: *** fsaccesstest test /mnt/array1 failed CHK_RESULT=1.
retrying... (1) ***
+ sleep 3
+ for i in '`seq 1 ${RETRY_MAX}`'
+ touch /mnt/array1/.accesstest
+ CHK_RESULT=0
+ '[' 0 -eq 0 ']'
+ break
+ '[' 0 -ne 0 ']'
+ set +x

> I don't even know what your actual *symptom* is.  I can't deduce it from
> a spew of strace output.  You haven't described the reason for the touch
> command, so I can only presume this is some sort of autofs environment,
> hence my attempt to solve the issue, above.

The purpose of this function is simply to try and create or modify a
test file on a RAID share.  It's just a periodic integrity check.

The set -x output shows that the 'touch' command failed, but it
doesn't show why.  I certainly acknowledge the strace output is a lot
of stuff to untangle, but it does show what is happening at a process
level and the return status for every system call.  What I'm seeing is
that there are lots of concurrent processes running at the time this
particular shell script function is executed.  The other processes
seem to have an adverse effect on the process that is executing
'touch'.  The three system calls that touch makes are all successful.
So my question is why is $? getting set to 1?  It looks to me like the
bash interpreter is returning early from this call to 'touch'.  I
can't prove this yet though.

Here's another example of an unsuccessful touch execution.  Notice how
the execution keeps getting interrupted and then resumed.

[pid 31400] execve("/tmp/test-touch", ["/tmp/test-touch",
"/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux",
"SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"...,
"HUSHLOGIN=FALSE", "USER=root", "PATH=/bin:/sbin:/usr/bin:/usr/sb"...,
"MAIL=/var/spool/mail/root", "C_INCLUDE_PATH=/usr/include:/usr"...,
"PWD=/usr/local/sbin", "HOME=/root", "SHLVL=2", "LOGNAME=root",
"_=/tmp/test-touch"]Process 31401 attached (waiting for parent)

[pid 31400] <... execve resumed> )  = 0
[pid 31400] brk(0 
[pid 31400] <... brk resumed> ) = 0x2369000

[pid 31400] open("/mnt/array1/.accesstest", O_RDWR|O_CREAT, 0666

[pid 31400] <... open resumed> )= 3
[pid 31400] close(3 
[pid 31400] <... close resumed> )   = 0
[pid 31400] utimes("/mnt/array1/.accesstest", {4196394, 0} 
[pid 31400] <... utimes resumed> )  = 0
[pid 31400] exit_group(0)   = ?
Process 23756 resumed
Process 31400 detached
[pid 23756] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) ==
0}], 0, NULL) = 31400
[pid 23756] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 23756] wait4(-1, Process 23756 suspended



Re: exit status issue

2011-11-22 Thread Dallas Clement
>> + touch /mnt/array1/.accesstest
>> + CHK_RESULT=1
>
> It looks to me that touch is failing and reporting the failure and
> that bash is handling it correctly.
>
>> + touch /mnt/array1/.accesstest
>> + CHK_RESULT=0
>
> And then on a subsequent pass touch is reporting success.
>
> This doesn't (yet) look like a problem with bash.

Admittedly bash seems to do the right thing if you go by the set -x
execution trace.  If you go by that, it would indeed seem that the
call to touch is failing.  But the strace output tells a different
story.  According to it, the system calls that touch makes are
succeeding.  In fact, I created my own 'touch' and 'stat' programs,
both of which log any errors.  I never see any errors logged, so I'm
pretty sure they are getting executed which jives with what I'm seeing
in the strace output.

Is this a lower level scheduling problem?  It certainly could be.  I'm
testing on Linux 2.6.39.4.   I'll rollback to an older kernel, libc,
and libpthread and see if that makes any difference.

>
>> The purpose of this function is simply to try and create or modify a
>> test file on a RAID share.  It's just a periodic integrity check.
>
> If I ever had even one single failure on a raid filesystem I would
> start replacing hardware.  I wouldn't be trying to retry.  I wouldn't
> be trying to umount and mount it again.  ANY failures would mean
> something so severely broken that I would consider the system unusable
> until repaired.

Yep, I agree - these sort of failures should not happen unless there
is something terribly wrong.

>
> On the other hand, do you really mean an NFS mounted filesystem?
> Perhaps using the nfs automounter?  If so then say so.  Don't say
> "share" which implies a MS Windows SMB / CIFS network mount.  The NFS
> client subsystem is notoriously bad.  The nfs automounter is worse.  I
> am guessing you probably really mean that you are using an nfs mounted
> filesystem using the automounter.  Pretty much everyone in that
> situation has tried to recover from problems with it at one time or
> another.

Not doing anything with NFS mounting thankfully.  It's a RAID array on
a NAS device which is typically used for SMB / CIFS network mount.



Re: exit status issue

2011-11-22 Thread Dallas Clement
Bob, Yes I agree with everything you've said.  I am grasping at
straws, which I really don't like.  It's very unlikely that the
problems are with any of these things I've mentioned bash, Linux
kernel, compiler, libc etc.   This script is fairly large and it does
lots of checking of block devices, raid arrays, and volumes.  All of
it is executed synchronously which should make it simple to debug.
The script itself is executed in the background as a daemon process.
This script is executed on locally mounted RAID devices.  No network
involved in the test.

I'll continue my search for root cause with a more disciplined
approach.   Thanks for the help.



Re: exit status issue

2011-11-23 Thread Dallas Clement
Hi All,

Just wanted to let you know that I have found a solution to my
problem, though I can't entirely explain it yet.  As mentioned
earlier, this script was designed to run as a daemon, i.e. run in the
background.  This is what I was doing to daemonize the script:


Daemon()
{
while :
do
checkStatus
sleep 1
done
}

trap CleanupExit INT
trap CleanupExit TERM
trap CleanupExit ILL
trap CleanupExit QUIT
trap CleanupExit ABRT
trap CleanupExit KILL

if [ x"$1" = x"start" ]; then
Daemon 0<&- 1>/dev/null 2>&1 &
echo $! > $PIDFILE
elif [ x"$1" = x"stop" ]; then
TMP=`cat $PIDFILE`
kill $TMP
CleanupExit
else
:
fi

When running like this, I was seeing 'touch' command failures inside
function fsaccesstest() shown previously.

I did an experiment last night and ran this script in the foreground
(removed the &) from the call to Daemon.  To my surprise, I did not
see any errors with the 'touch' command.

I then went back and changed how I was turning this script into a
daemon, by using the start-stop-daemon utility instead like this:

start-stop-daemon --start --background --quiet --exec $DAEMON

Once I did this, my shell script is now running in the background
without any errors.  Life is good.

Perhaps when I was calling Daemon & as shown above, the parent process
was not exiting properly and was somehow interfering in the child
process execution.  I'm not really sure.

Anyhow, I'm past this problem now.  Thank you all for the good ideas
and generous help.