Hi Prentice, three questions (not necessarily to you and it can be dealt with in a different thread too):
- why automount and not a static mount? - do I get that right that the nodes itself export shares to other nodes? - has anything changed? I am thinking of something like more nodes added, new programs being installed, more users added, generally a higher load on the cluster. One problem I had in the past with my 112 node cluster where I am exporting /home, /opt and one directory in /usr/local to all the nodes from the headnode was that the NFS-server on the headnode did not have enough spare servers assigned and thus was running out of capacity. That also lead to strange behaviour which I fixed by increasing the numbers of spare servers. The way I have done that was setting this in /etc/default/nfs-kernel-server # Number of servers to start up RPCNFSDCOUNT=32 That seems to provide the right amount of servers and spare ones for me. Like in your case, the cluster was running stable until I added more nodes *and* users decided to use them, i.e. the load of the cluster got up. A more idle cluster did not show any problems, a cluster under 80 % load suddenly had problem. I hope that helps a bit. I am not the expert in NFS as well and this is just my experience. I am also using Debian nfs-kernel-server 1:1.2.6-4 if that helps. All the best from a sunny London Jörg On Mittwoch 19 April 2017 Prentice Bisbal wrote: > On 04/19/2017 02:17 PM, Ellis H. Wilson III wrote: > > On 04/19/2017 02:11 PM, Prentice Bisbal wrote: > >> Thanks for the suggestion(s). Just this morning I started considering > >> the network as a possible source of error. My stale file handle errors > >> are easily fixed by just restarting the nfs servers with 'service nfs > >> restart', so they aren't as severe you describe. > > > > If a restart on solely the /server-side/ gets you back into a good > > state this is an interesting tidbit. > > That is correct, restarting NFS on the server-side is all it takes to > fix the problem > > > Do you have some form of HA setup for NFS? Automatic failover > > (sometimes setup with IP aliasing) in the face of network hiccups can > > occasionally goof the clients if they aren't setup properly to keep up > > with the change. A restart of the server will likely revert back to > > using the primary, resulting in the clients thinking everything is > > back up and healthy again. This situation varies so much between > > vendors it's hard to say much more without more details on your setup. > > My setup isn't nearly that complicated. Every node in this cluster has a > /local directory that is shared out to the other nodes in the cluster. > The other nodes automount this by remote directory as /l/hostname, where > "hostname" is the name of owner of the filesystem. For example, hostB > will mount hostA:/local as /l/lhostA. > > No fancy fail-over or anything like that. > > > Best, > > > > ellis > > > > P.S., apologies for the top-post last time around. > > NO worries. I'm so used to people doing that, in mailing lists that I've > become numb to it. > > Prentice > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- ************************************************************* Dr. Jörg Saßmannshausen, MRSC University College London Department of Chemistry 20 Gordon Street London WC1H 0AJ email: j.sassmannshau...@ucl.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf