On 04/19/2017 05:52 PM, Bernd Schubert wrote:
On 04/19/2017 07:58 PM, Prentice Bisbal wrote:
Here's the sequence of events:
1. First job(s) run fine on the node and complete without error.
2. Eventually a job fails with a 'permission denied' error when it tries
to access /l/hostname.
So you don't get ESTALE, but you get EACCESS? You *might* be able to fix
this by setting the 'no_subtree_check' in your /etc/exports. I don't
remember the details exactly anymore, but nfsd/exportfs check more
intensively if a dentry is valid if this option is not given.
I don't remember seeing either ESTALE or EACCESS, just that there was a
message about stale file handles. I didn't save the messages I with
tcpdump, and I had to delete my /var/log/message files because when
turned all the logging I could with rpcdebug, it filled up /var in less
than a day, and I needed to free up space in /var. I should have copied
them somewhere else instead of just deleting them, in hindsight.
I rebooted the systems yesterday, and the problem has gone away since
the reboot, so I can't reproduce the problem and send you the relevant
messages. I"m not a smart man.
I don't think that networking can be a cause for this, but if a
dentry/inode is evicted from the server side cache, the NFS file handle
has to be used to create inode and dentry on the server side on the
underlying file system. I think EACCESS is then used if something goes
wrong connecting the dentry to the parent-dentry (I need to look up the
exact details again, it's been while I had to deal with this).
Are these meanings of EACESS and ESTALE defined in the NFS RFCs? If so,
may need to read that.
You could try to set /proc/sys/vm/vfs_cache_pressure to a very low value
(don't set it to 0, though). Depending on your file system and kernel
version this might help to keep dentries/inode in the cache and to avoid
running into this (there was bug until 3.10, which prevented that this
worked properly, I'm not sure if the related patch series has been
backported into vendor kernels).
Thanks for the tip. I'll keep it in mind.
Btw, which kernel version and file system is your nfs server running on?
Both servers and clients are running the same exact version of
everything, since they are using the same NFS root filesystem:
$ cat /etc/redhat-release
CentOS release 6.8 (Final)
$ cat /proc/version
Linux version 2.6.32-642.11.1.el6.x86_64
(mockbu...@c1bm.rdu2.centos.org) (gcc version 4.4.7 20120313 (Red Hat
4.4.7-17) (GCC) ) #1 SMP Fri Nov 18 19:25:05 UTC 2016
$ rpm -qa | grep -i nfs
nfs-utils-lib-1.1.5-11.el6.x86_64
nfs-utils-1.2.3-70.el6_8.2.x86_64
nfs4-acl-tools-0.3.3-8.el6.x86_64
Bernd
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf