Right, this appears (unsurprisingly) to be a mutex contention issue, on
the file-specific mutex.

See the attached trace: server-kirby-v3.dmsg

Key parts are:

[ 5845.725268] lockd: request from 192.168.1.210, port=860
[ 5845.725272] lockd: LOCK          called
[ 5845.725274] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, 
name=romita)
[ 5845.725504] lockd: get host romita
[ 5845.725506] lockd: found host in cache
[ 5845.725507] lockd: nsm_monitor(romita)
[ 5845.725509] lockd: nlm_file_lookup (01070001 00288001 00000000 926e57da 
d142d9c6 dabb48bd c2a30bcf 0028c75d)
[ 5845.725789] lockd: found file f7aa2840 (count 0)
[ 5845.725792] lockd: nlmsvc_lock(sda1/2672477, ty=0, pi=80357, 
1073741826-1073742335, bl=0)
[ 5845.725806] lockd: nlmsvc_lookup_block f=f7aa2840 pd=80357 
1073741826-1073742335 ty=0
[ 5845.725809] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, 
name=romita)
[ 5845.726186] lockd: host garbage collection
[ 5845.726188] lockd: nlmsvc_mark_resources
[ 5845.726189] lockd: nlm_traverse_files
[ 5845.726255] lockd: mutex acquired, checking 128 file hash entries
[ 5845.726257] lockd: got entry in list 29
[ 5845.726259] lockd: inspecting file f=f7afd480
[ 5845.726260] lockd: traverse blocks
[ 5845.726262] lockd: locking file mutex
[ 5845.726483] lockd: unlocking file mutex
[ 5845.726485] lockd: traverse shares
[ 5845.726486] lockd: traverse locks
[ 5845.726488] lockd: inspection complete
[ 5845.726625] lockd: check file for release
...
(Same pattern repeated for several other files)
...
[ 5845.728644] lockd: got entry in list 58
[ 5845.728645] lockd: inspecting file f=f7aa2840
[ 5845.728646] lockd: traverse blocks
[ 5845.728648] lockd: locking file mutex
...
The final debug is from nlmsvc_traverse_blocks() in 
/usr/src/linux/fs/lockd/svclock.c

        dprintk ("lockd: locking file mutex\n");
        mutex_lock(&file->f_mutex);
        list_for_each_entry_safe(block, next, &file->f_blocks, b_flist) {
                dprintk ("lockd: trying block for host %p\n", host)
                ...
        }
        dprintk ("lockd: unlocking file mutex\n");

And it's clear now that we're calling the mutex_lock and never leaving.

The important note is that all the previous file checks worked.  Why is there a 
mutex 
already taken on only this file?  Well, note that this is the file from the 
request that 
actually triggered the GC.  Presumably there's a mutex taken out for this file, 
then
we run the GC, and we attempt to re-take out the mutex.  I'll trawl the code 
and 
confirm this.

If so, the fix is probably to move the call to the GC so that it's outside the 
handling
for the actual RPC call.  In fact, the mutex isn't strictly required for the GC 
because 
in this case we're only counting host references.  But it looks like we're 
doing our 
reference count by piggybacking on some other code which actually does sweeps 
of 
file locks, so we can't just remove the mutexes.

** Attachment added: "server-kirby-v3.dmsg"
   http://launchpadlibrarian.net/11456002/server-kirby-v3.dmsg

-- 
NFS server: lockd: server not responding
https://bugs.launchpad.net/bugs/181996
You received this bug notification because you are a member of Ubuntu
Bugs, which is the bug contact for Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to