Re: [Beowulf] Troubleshooting NFS stale file handles

Prentice Bisbal Thu, 20 Apr 2017 14:22:43 -0700


On 04/20/2017 05:04 AM, Tim Cutts wrote:

I've seen, in the past, problems with fragmented packets being misinterpreted, 
resulting in stale NFS symptoms. In that case it was an Intel STL motherboard 
(we're talking 20 years ago here), which shared a NIC for management as well as 
the main interface.  The fragmented packets got inappropriately intercepted by 
the management processor and never reached Linux.  That took ages to nail down.


One question I was going to ask - which automounter are you using?  autofs or 
am-utils?

Whatever comes with CentOS 6.8. 'rpm -qi autofs' says I'm using autofsfrom http://wiki.autofs.net/


Tim

Sent from my iPhone

On 19 Apr 2017, at 7:11 pm, Prentice Bisbal <pbis...@pppl.gov> wrote:

Ellis,

Thanks for the suggestion(s). Just this morning I started considering the 
network as a possible source of error. My stale file handle errors are easily 
fixed by just restarting the nfs servers with 'service nfs restart', so they 
aren't as severe you describe.

Prentice

On 04/19/2017 02:03 PM, Ellis H. Wilson III wrote:
Here are a couple conditions to look for that I've seen stale NFS file handles 
caused by.  These are rather high-level to just get you started.  Sorry, short 
on time today:

1. Are you sure your NFS server isn't getting swamped by the jobs such that it 
drops packets back to the clients?  Completely overwhelming an NFS server for 
sufficient lengths of time might cause this, though it's rare.

2. Are you sure that your clients (and the NFS server itself) has a solid 
network connection?  Frequent network hiccups can trigger stale NFS file 
handles that occasionally require a hard reboot for me.  This is the more 
common case I see.

Both of these essentially relate to the same thing, which is the connection 
between the NFS server and clients becoming stalled for too long a time at some 
point.  In theory NFS should deal with this gracefully, but there are 
corner-cases (that ironically get hit more often than I feel like they should) 
where it gets stuck in a way that's rather sticky and tends to require reboot.

Best,

ellis

On 04/19/2017 01:58 PM, Prentice Bisbal wrote:
Here's the sequence of events:

1. First job(s) run fine on the node and complete without error.

2. Eventually a job fails with a 'permission denied' error when it tries
to access /l/hostname.

Since no jobs fail with a file I/O error, it's hard to confirm that the
jobs themselves are causing the problem. However, if these particular
jobs are the only thing running on the cluster and should be the only
jobs accessing these NFS shares, what else could be causing them.

All these systems are getting their user information from LDAP. Since
some jobs run before these errors appear, lack of, or inaccurate user
info doesn't seem to be a likely source of this problem, but I'm not
ruling anything out at this point.

Important detail: This is NFSv3.

Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov

On 04/19/2017 12:20 PM, Ryan Novosielski wrote:
Are you saying they can’t mount the filesystem, or they can’t write to
a mounted filesystem? Where does this system get its user information
from, if the latter?

--
____
|| \\UTGERS,
|---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB
C630, Newark
      `'

On Apr 19, 2017, at 12:09, Prentice Bisbal <pbis...@pppl.gov> wrote:

Beowulfers,

I've been trying to troubleshoot a problem for the past two weeks
with no luck. We have a cluster here that runs only one application
(although the details of that application change significantly from
run-to-run.). Each node in the cluster has an NFS export, /local,
that can be automounted by every other node in the cluster as
/l/hostname.

Starting about two weeks ago, when jobs would try to access
/l/hostname, they would get permission denied messages. I tried
analyzing this problem by turning on all NFS/RPC logging with
rpcdebug and also using tcpdump while trying to manually mount one of
the remote systems. Both approaches indicated state file handles were
prevent the share from being mounted.

Since it has been 6-8 weeks since there were any seemingly relevant
system config changes, I suspect it's an application problem
(naturally). On the other hand, the application developers/users
insist that they haven't made any changes, to their code, either. To
be honest, there's no significant evidence indicating either is at
fault. Any suggestions on how to debug this and definitively find the
root cause of these stale file handles?

--
Prentice
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Troubleshooting NFS stale file handles

Reply via email to