On Wed, Apr 19, 2017 at 8:34 PM, Prentice Bisbal <pbis...@pppl.gov> wrote: > My setup isn't nearly that complicated. Every node in this cluster has a > /local directory that is shared out to the other nodes in the cluster. The > other nodes automount this by remote directory as /l/hostname, where > "hostname" is the name of owner of the filesystem. For example, hostB will > mount hostA:/local as /l/lhostA.
Some more questions to provide a better picture: - at the time the error message appears, are there several hostB mounting the same export from hostA ? If so, do they all experience the error condition ? - is the one application the only way to trigger the error message ? Or are you able (as root or as the user running the application) able to also reproduce the problem using simple tools like ls and cat ? If not, what is the output from the tools when the problem appears ? - do you use Kerberos or some similar mechanism where the access is limited in time (for Kerberos by the lifetime of the ticket) ? - have you tried to fix the client side instead of the nfsd restart on the server side, f.e. by restarting autofs, forcing manual unmount then mount, etc ? - do you have logs of the activity of autofs and can check what remote FSes are mounted (or not...) when the error condition appears ? - is the one application run by a single user ? If so, the error message can only mean access to system files. Does the error occur in the same place in the application ? Do you have the source code of the application to add code to better describe the failure conditions ? Cheers, Bogdan _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf