Hello List, recently, I've looked into some dangling lock problems we've had after partial power loss. Here's my analysis of what happens: -A user application on a compute node requests a lock for a file on a NFS-mounted file system; -the NFS server grants the lock; -a partial power loss (just one phase affected for a few ms) causes the compute node to reboot, whereas the server runs on; -if the compute node is stateful, it will look through the entries in /var/lib/nfs/sm (the "monitor list") to discover from which server(s) it had mounted NFS shares, and send each of them an NSM notify message; -notified servers drop locks from the affected compute node.
However, this does not work for diskless compute nodes since upon reboot, their monitor list will be empty, leaving dangling locks around. One could work around the problem by triggering a round of notify messages from the server, causing all nodes that didn't reboot to re-request any pertinent locks and dropping all others. However, a more automatic solution would be nice, especially when more than one or two NFS servers are involved. How do you deal with this? Thanks, A. -- Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics http://www.mpibpc.mpg.de/grubmueller/esztermann
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf