I've had far fewer unexplained (although admittedly there was a limited search for the guilty) NFS issues since I started using fsid= in my NFS exports. If you aren't setting that it might be worth a try. NFS seems to be much better at recovering from problems with an fsid assigned to the root of exports.
jbh On Wed, Apr 19, 2017 at 8:58 PM Prentice Bisbal <pbis...@pppl.gov> wrote: > Here's the sequence of events: > > 1. First job(s) run fine on the node and complete without error. > > 2. Eventually a job fails with a 'permission denied' error when it tries > to access /l/hostname. > > Since no jobs fail with a file I/O error, it's hard to confirm that the > jobs themselves are causing the problem. However, if these particular > jobs are the only thing running on the cluster and should be the only > jobs accessing these NFS shares, what else could be causing them. > > All these systems are getting their user information from LDAP. Since > some jobs run before these errors appear, lack of, or inaccurate user > info doesn't seem to be a likely source of this problem, but I'm not > ruling anything out at this point. > > Important detail: This is NFSv3. > > Prentice Bisbal > Lead Software Engineer > Princeton Plasma Physics Laboratory > http://www.pppl.gov > > On 04/19/2017 12:20 PM, Ryan Novosielski wrote: > > Are you saying they can’t mount the filesystem, or they can’t write to a > mounted filesystem? Where does this system get its user information from, > if the latter? > > > > -- > > ____ > > || \\UTGERS, > |---------------------------*O*--------------------------- > > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > > || \\ University | Sr. Technologist - 973/972.0922 <(973)%20972-0922> > (2x0922) ~*~ RBHS Campus > > || \\ of NJ | Office of Advanced Research Computing - MSB > C630, Newark > > `' > > > >> On Apr 19, 2017, at 12:09, Prentice Bisbal <pbis...@pppl.gov> wrote: > >> > >> Beowulfers, > >> > >> I've been trying to troubleshoot a problem for the past two weeks with > no luck. We have a cluster here that runs only one application (although > the details of that application change significantly from run-to-run.). > Each node in the cluster has an NFS export, /local, that can be automounted > by every other node in the cluster as /l/hostname. > >> > >> Starting about two weeks ago, when jobs would try to access > /l/hostname, they would get permission denied messages. I tried analyzing > this problem by turning on all NFS/RPC logging with rpcdebug and also using > tcpdump while trying to manually mount one of the remote systems. Both > approaches indicated state file handles were prevent the share from being > mounted. > >> > >> Since it has been 6-8 weeks since there were any seemingly relevant > system config changes, I suspect it's an application problem (naturally). > On the other hand, the application developers/users insist that they > haven't made any changes, to their code, either. To be honest, there's no > significant evidence indicating either is at fault. Any suggestions on how > to debug this and definitively find the root cause of these stale file > handles? > >> > >> -- > >> Prentice > >> _______________________________________________ > >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > >> To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- ‘[A] talent for following the ways of yesterday, is not sufficient to improve the world of today.’ - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf