Right, this appears (unsurprisingly) to be a mutex contention issue, on the file-specific mutex.
See the attached trace: server-kirby-v3.dmsg Key parts are: [ 5845.725268] lockd: request from 192.168.1.210, port=860 [ 5845.725272] lockd: LOCK called [ 5845.725274] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, name=romita) [ 5845.725504] lockd: get host romita [ 5845.725506] lockd: found host in cache [ 5845.725507] lockd: nsm_monitor(romita) [ 5845.725509] lockd: nlm_file_lookup (01070001 00288001 00000000 926e57da d142d9c6 dabb48bd c2a30bcf 0028c75d) [ 5845.725789] lockd: found file f7aa2840 (count 0) [ 5845.725792] lockd: nlmsvc_lock(sda1/2672477, ty=0, pi=80357, 1073741826-1073742335, bl=0) [ 5845.725806] lockd: nlmsvc_lookup_block f=f7aa2840 pd=80357 1073741826-1073742335 ty=0 [ 5845.725809] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, name=romita) [ 5845.726186] lockd: host garbage collection [ 5845.726188] lockd: nlmsvc_mark_resources [ 5845.726189] lockd: nlm_traverse_files [ 5845.726255] lockd: mutex acquired, checking 128 file hash entries [ 5845.726257] lockd: got entry in list 29 [ 5845.726259] lockd: inspecting file f=f7afd480 [ 5845.726260] lockd: traverse blocks [ 5845.726262] lockd: locking file mutex [ 5845.726483] lockd: unlocking file mutex [ 5845.726485] lockd: traverse shares [ 5845.726486] lockd: traverse locks [ 5845.726488] lockd: inspection complete [ 5845.726625] lockd: check file for release ... (Same pattern repeated for several other files) ... [ 5845.728644] lockd: got entry in list 58 [ 5845.728645] lockd: inspecting file f=f7aa2840 [ 5845.728646] lockd: traverse blocks [ 5845.728648] lockd: locking file mutex ... The final debug is from nlmsvc_traverse_blocks() in /usr/src/linux/fs/lockd/svclock.c dprintk ("lockd: locking file mutex\n"); mutex_lock(&file->f_mutex); list_for_each_entry_safe(block, next, &file->f_blocks, b_flist) { dprintk ("lockd: trying block for host %p\n", host) ... } dprintk ("lockd: unlocking file mutex\n"); And it's clear now that we're calling the mutex_lock and never leaving. The important note is that all the previous file checks worked. Why is there a mutex already taken on only this file? Well, note that this is the file from the request that actually triggered the GC. Presumably there's a mutex taken out for this file, then we run the GC, and we attempt to re-take out the mutex. I'll trawl the code and confirm this. If so, the fix is probably to move the call to the GC so that it's outside the handling for the actual RPC call. In fact, the mutex isn't strictly required for the GC because in this case we're only counting host references. But it looks like we're doing our reference count by piggybacking on some other code which actually does sweeps of file locks, so we can't just remove the mutexes. ** Attachment added: "server-kirby-v3.dmsg" http://launchpadlibrarian.net/11456002/server-kirby-v3.dmsg -- NFS server: lockd: server not responding https://bugs.launchpad.net/bugs/181996 You received this bug notification because you are a member of Ubuntu Bugs, which is the bug contact for Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs