Hello list,
(please CC me, I am not subscribed here)
I have got a problem with one of our NFS exports which is hard to debug
and to reproduce.
Setup (simplized):
- Two powerful VMware ESX hosts connected (crossconnection) with 4 x 1
GBit/s
- One connection is only reserved for NFS traffic
- A NFS server, which stores its data on a DRBD volume (Wheezy and NFSv4
with RPCNFSDCOUNT=50)
- The problematic client (there are two dozen clients) is Debian
Squeeze. In this case the NFS server and the client are on the same
dedicated host machine
- Client mounts the share with:
192.168.55.31:/var /srv/nfs/magento_var nfs
_netdev,auto,soft,intr,rw,noatime,nodiratime
So what happens?
Sometimes if the application is flushing its cache the process hangs for
hours. Accessing the NFS and the process itself (for example with strace
-p XXXX) is not possible (strace had to be killed with -9).
Reproducing the script with strace until it happens I see that it hangs
on this operation:
stat("/srv/nfs/magento_var/cache/mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_",
{st_mode=S_IFREG|0600, st_size=112, ...}) = 0
unlink("/srv/nfs/magento_var/cache/mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_"
On the client with sunrpc.nfs_debug=1023 I see this:
[1545947.053884] NFS: nfs_update_inode(0:11/5243170 ct=1 info=0x27e7f)
[1545947.053886] NFS:
nfs_lookup_revalidate(mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_)
is valid
[1545947.053890] NFS:
dentry_delete(mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_,
8)
[1545947.054006] NFS: permission(0:11/3276803), mask=0x1, res=0
[1545947.054009] NFS: nfs_lookup_revalidate(var/cache) is valid
[1545947.054011] NFS: permission(0:11/3278035), mask=0x1, res=0
[1545947.054012] NFS: nfs_lookup_revalidate(cache/mage--a) is valid
[1545947.054014] NFS: permission(0:11/5243102), mask=0x1, res=0
[1545947.054017] NFS: permission(0:11/5243102), mask=0x1, res=0
[1545947.054019] NFS:
nfs_lookup_revalidate(mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_)
is valid
[1545947.054021] NFS: permission(0:11/5243102), mask=0x3, res=0
[1545947.054023] NFS: unlink(0:11/5243102,
mage---alphanumericZend_LocaleC_de_DE_currencynumber_)
[1545947.054025] NFS:
safe_remove(mage--a/mage---alphanumericZend_LocaleC_de_DE_currencynumber_)
So it seems to happen sometimes while the script is removing files.
The curious is also, that other clients do not have any problems (also
Squeeze) to access the data at the same time on the same share. There is
no noticeable load (network, CPU, HDD, HDD latency, DRBD, NFS etc) on
the machines. Both packagefilters have ACCEPT all for incoming and
outgoing traffic on this dedicated interface. No errors, dropped or
overruns on this NIC.
Do you have got an idea how to better debug it or what the main problem
could be?
--
/*
Mit freundlichem Gruß / With kind regards,
Patrick Matthäi
GNU/Linux Debian Developer
Blog: http://www.linux-dev.org/
E-Mail: pmatth...@debian.org
patr...@linux-dev.org
*/
--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/52a1d6c7.7030...@debian.org