Hello List,

recently, I've looked into some dangling lock problems we've had after
partial power loss.
Here's my analysis of what happens:
-A user application on a compute node requests a lock for a file on a
 NFS-mounted file system;
-the NFS server grants the lock;
-a partial power loss (just one phase affected for a few ms) causes
 the compute node to reboot, whereas the server runs on;
-if the compute node is stateful, it will look through the entries in
 /var/lib/nfs/sm (the "monitor list") to discover from which server(s)
 it had mounted NFS shares, and send each of them an NSM notify message;
-notified servers drop locks from the affected compute node.

However, this does not work for diskless compute nodes since upon
reboot, their monitor list will be empty, leaving dangling locks
around. 

One could work around the problem by triggering a round of notify
messages from the server, causing all nodes that didn't reboot to
re-request any pertinent locks and dropping all others.
However, a more automatic solution would be nice, especially when more
than one or two NFS servers are involved.

How do you deal with this?

Thanks,

A.
-- 
Ansgar Esztermann
Sysadmin Dep. Theoretical and Computational Biophysics
http://www.mpibpc.mpg.de/grubmueller/esztermann

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to