Beowulfers,

I've been trying to troubleshoot a problem for the past two weeks with no luck. We have a cluster here that runs only one application (although the details of that application change significantly from run-to-run.). Each node in the cluster has an NFS export, /local, that can be automounted by every other node in the cluster as /l/hostname.

Starting about two weeks ago, when jobs would try to access /l/hostname, they would get permission denied messages. I tried analyzing this problem by turning on all NFS/RPC logging with rpcdebug and also using tcpdump while trying to manually mount one of the remote systems. Both approaches indicated state file handles were prevent the share from being mounted.

Since it has been 6-8 weeks since there were any seemingly relevant system config changes, I suspect it's an application problem (naturally). On the other hand, the application developers/users insist that they haven't made any changes, to their code, either. To be honest, there's no significant evidence indicating either is at fault. Any suggestions on how to debug this and definitively find the root cause of these stale file handles?

--
Prentice
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to