Re: [Beowulf] Troubleshooting NFS stale file handles

Neil McFadyen Sun, 23 Apr 2017 06:24:42 -0700

I had a similar problem and it turned out to be a disk problem. SMARTattributes showed high188 Command_Timeout values for 1 of the disks in the RAID array on thestorage server.The server would become inaccessible, i.e., couldn't even ping it, withno errors in the server's logs. Had to reboot the server then it wouldwork for a while and then happen again. After changing the disk fixed it.


Neil McFadyen
Carleton University


On 2017-04-19 1:58 PM, Prentice Bisbal wrote:

Here's the sequence of events:

1. First job(s) run fine on the node and complete without error.
2. Eventually a job fails with a 'permission denied' error when ittries to access /l/hostname.
Since no jobs fail with a file I/O error, it's hard to confirm thatthe jobs themselves are causing the problem. However, if theseparticular jobs are the only thing running on the cluster and shouldbe the only jobs accessing these NFS shares, what else could becausing them.
All these systems are getting their user information from LDAP. Sincesome jobs run before these errors appear, lack of, or inaccurate userinfo doesn't seem to be a likely source of this problem, but I'm notruling anything out at this point.
Important detail: This is NFSv3.

Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov

On 04/19/2017 12:20 PM, Ryan Novosielski wrote:
Are you saying they can’t mount the filesystem, or they can’t writeto a mounted filesystem? Where does this system get its userinformation from, if the latter?
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHSCampus|| \\ of NJ | Office of Advanced Research Computing - MSBC630, Newark
      `'
On Apr 19, 2017, at 12:09, Prentice Bisbal <pbis...@pppl.gov> wrote:

Beowulfers,
I've been trying to troubleshoot a problem for the past two weekswith no luck. We have a cluster here that runs only one application(although the details of that application change significantly fromrun-to-run.). Each node in the cluster has an NFS export, /local,that can be automounted by every other node in the cluster as/l/hostname.
Starting about two weeks ago, when jobs would try to access/l/hostname, they would get permission denied messages. I triedanalyzing this problem by turning on all NFS/RPC logging withrpcdebug and also using tcpdump while trying to manually mount oneof the remote systems. Both approaches indicated state file handleswere prevent the share from being mounted.
Since it has been 6-8 weeks since there were any seemingly relevantsystem config changes, I suspect it's an application problem(naturally). On the other hand, the application developers/usersinsist that they haven't made any changes, to their code, either. Tobe honest, there's no significant evidence indicating either is atfault. Any suggestions on how to debug this and definitively findthe root cause of these stale file handles?
--
Prentice
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by PenguinComputingTo change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf



--
Neil McFadyen, M.Eng., P.Eng.
Supervisor of Computer Operations
Mechanical & Aerospace Engineering
Carleton University
Ottawa, Ontario
K1S 5B6
tel: 613-520-2600 ext 5636
fax: 613-520-5715
email: nmcfa...@mae.carleton.ca
    or neilb.mcfad...@carleton.ca

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Troubleshooting NFS stale file handles

Reply via email to