Re: [Ocfs2-users] Tracking down hangs

Sunil Mushran Thu, 03 Jun 2010 14:27:26 -0700

If scanlocks is clean, means it is not a dlm issue.

Have you tried mounting with data=writeback? With drbd,
a 1G write becomes a 2G write. With ordered mode, a journal
checkpoint, which is done when relinquishing a write lock, will
wait on the data flush. That could be the cause for the slowdown.
Does drbd have any way to see how active it is at that time? If
so, monitor that.


BTW, readonly does not mean no-cache-coherency. It only means
that the userspace cannot write. But the fs is fully-cache-coherent
at all times. So there is no advantage performance wise.

On 06/03/2010 03:12 AM, Andrew Robert Nicols wrote:

We're using a storage solution involving two SunFire X4500 servers using
DRBD to replicate a 15TB partition across the network with ocfs2 on top.
We're sharing the partition from one server over NFS and the other is
mounted read-only at present.  The DBRD backing store is software RAID 60
on 40 disks.

We've been seeing periodic issues whereby our NFS clients (Debian Lenny)
are very slow to perform simple operations such as cat a 4 character file,
or perform an ls.

This affects all of the NFS clients at the same time and typically lasts
from between a few seconds to maybe 2 minutes. Operation then continues as
normal and service resumes. We've also seen this affecting the read-only
server which has the ocfs2 partition mounted.

We've been having trouble trying to find out the cause of the issues. but
can reliably reproduce such failures as follows:

On each host, check for cats taking longer than 1 second:
while true; do time cat /srv/healthchecks/smallfile>  /dev/null; done 2>&1 | 
awk '/m[1-9]/ {print strftime(), $_}'

To actually reproduce the failure, we then run a dd on the filestore:
dd if=/dev/zero of=/srv/test/dd-test-`date +%s` bs=1M count=1000&&  echo 
"Syncing"&&  time sync

At the time that the sync finishes, all of the NFS clients and the
read-only server show that it took some time to return the cat of an
unrelated file - usually the same amount of time it took to run the sync.

What's the best place to start looking for the cause of these hangs? I've
attached the dmesg output which includes some call traces for hung threads.
I have stat_sysdir output though I suspect it's not so relevant. A
scanlocks output doesn't reveal any busy locks that I can see (unless I'm
not hitting it at the right time or misreading the output).

For the DRBD replication there's a pair of bonded GBit NICs dedicated to
the job. The other two GBit bonded NICs in the boxes are being used for NFS
and o2net/ocfs communication. We don't believe that the network is at
fault.

We're using Debian Lenny with the stock AMD64 kernel out of the Lenny
repository - 2.6.26-2-amd64.
We're using ocfs2 tools version 1.4.4 which we have packaged for Debian 
ourselves.

The ocfs2 version reported in /sys/module/ocfs2/version is 1.5.0.

Here's the current o2cb configuration:

r...@thumper5:#/srv/test# /etc/init.d/o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster thumperpool: Online
Heartbeat dead threshold = 61
   Network idle timeout: 60000
   Network keepalive delay: 2000
   Network reconnect delay: 2000
Checking O2CB heartbeat: Active

We also tried a Heartbeat dead threshold of 31 with a Network idle timeout
of 30000 to the same effect.

Any assistance would be very much appreciated,

Andrew


_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Tracking down hangs

Reply via email to