Re: [Beowulf] Troubleshooting NFS stale file handles

2017-04-20 Thread Prentice Bisbal
On 04/20/2017 05:04 AM, Tim Cutts wrote: I've seen, in the past, problems with fragmented packets being misinterpreted, resulting in stale NFS symptoms. In that case it was an Intel STL motherboard (we're talking 20 years ago here), which shared a NIC for management as well as the main interf

Re: [Beowulf] Troubleshooting NFS stale file handles

2017-04-20 Thread Prentice Bisbal
Thanks for the tip. I hadn't even thought of looking at SMART, although any errors should show up in the logwatch e-mails, which I do check every day, and haven't seen any on these systems. I also heard recently that the smartmontools that come with most Linux distros are horribly old, and the

Re: [Beowulf] Troubleshooting NFS stale file handles

2017-04-20 Thread Prentice Bisbal
On 04/19/2017 05:52 PM, Bernd Schubert wrote: On 04/19/2017 07:58 PM, Prentice Bisbal wrote: Here's the sequence of events: 1. First job(s) run fine on the node and complete without error. 2. Eventually a job fails with a 'permission denied' error when it tries to access /l/hostname. So you

Re: [Beowulf] Troubleshooting NFS stale file handles

2017-04-20 Thread Prentice Bisbal
On 04/19/2017 03:21 PM, Jörg Saßmannshausen wrote: Hi Prentice, three questions (not necessarily to you and it can be dealt with in a different thread too): - why automount and not a static mount? Well, I've been told that, in general, automounting reduces the load(s) on the servers, since th

Re: [Beowulf] Troubleshooting NFS stale file handles

2017-04-20 Thread Charlie Peck
+1 for looking at the MTUs. I just finished debugging what was manifesting as transient NFS problems of various types but turned-out to be MTU mis-matches. charlie > On Apr 20, 2017, at 09:51, Gavin W. Burris wrote: > > Remembering that I once had two switches that were not allowing jumbo fram

Re: [Beowulf] Troubleshooting NFS stale file handles

2017-04-20 Thread Gavin W. Burris
Remembering that I once had two switches that were not allowing jumbo frames over a crossover link. Similar if not the same symptoms. Cheers. On Thu 04/20/17 09:17AM EDT, Gavin W. Burris wrote: > Hi, Prentice. > > Have you checked MTU matches on all NICs and is honored by the router? > > Chee

Re: [Beowulf] Troubleshooting NFS stale file handles

2017-04-20 Thread Gavin W. Burris
Hi, Prentice. Have you checked MTU matches on all NICs and is honored by the router? Cheers. On Wed 04/19/17 02:34PM EDT, Prentice Bisbal wrote: > > On 04/19/2017 02:17 PM, Ellis H. Wilson III wrote: > >On 04/19/2017 02:11 PM, Prentice Bisbal wrote: > >>Thanks for the suggestion(s). Just this m

Re: [Beowulf] Troubleshooting NFS stale file handles

2017-04-20 Thread John Hearns
The value fortcp_slot_table_entries seemed very low to me on our system. However, reading up on it the value is autotuned https://researcher.watson.ibm.com/researcher/view_person_subpage.php?id=4427 sunrpc.tcp_max_slot_table_entries = 65536 sunrpc.tcp_slot_table_entries = 2 Prentice, it wouldn't

Re: [Beowulf] Troubleshooting NFS stale file handles

2017-04-20 Thread John Hearns
Tim That reminds me of the issue I found with shared IPMI interfaces - the reserved IPMI port clashing with the sunrpc.min_resvport (or more exactly the range of Sun RPC ports overlapping with IPMI) That was a long time ago, and the min_resvport has been increased in modern kernels as far as I

Re: [Beowulf] Troubleshooting NFS stale file handles

2017-04-20 Thread Bogdan Costescu
On Wed, Apr 19, 2017 at 8:34 PM, Prentice Bisbal wrote: > My setup isn't nearly that complicated. Every node in this cluster has a > /local directory that is shared out to the other nodes in the cluster. The > other nodes automount this by remote directory as /l/hostname, where > "hostname" is the

Re: [Beowulf] Troubleshooting NFS stale file handles

2017-04-20 Thread Tim Cutts
I've seen, in the past, problems with fragmented packets being misinterpreted, resulting in stale NFS symptoms. In that case it was an Intel STL motherboard (we're talking 20 years ago here), which shared a NIC for management as well as the main interface. The fragmented packets got inappropria