If scanlocks is clean, means it is not a dlm issue. Have you tried mounting with data=writeback? With drbd, a 1G write becomes a 2G write. With ordered mode, a journal checkpoint, which is done when relinquishing a write lock, will wait on the data flush. That could be the cause for the slowdown. Does drbd have any way to see how active it is at that time? If so, monitor that.
BTW, readonly does not mean no-cache-coherency. It only means that the userspace cannot write. But the fs is fully-cache-coherent at all times. So there is no advantage performance wise. On 06/03/2010 03:12 AM, Andrew Robert Nicols wrote:
We're using a storage solution involving two SunFire X4500 servers using DRBD to replicate a 15TB partition across the network with ocfs2 on top. We're sharing the partition from one server over NFS and the other is mounted read-only at present. The DBRD backing store is software RAID 60 on 40 disks. We've been seeing periodic issues whereby our NFS clients (Debian Lenny) are very slow to perform simple operations such as cat a 4 character file, or perform an ls. This affects all of the NFS clients at the same time and typically lasts from between a few seconds to maybe 2 minutes. Operation then continues as normal and service resumes. We've also seen this affecting the read-only server which has the ocfs2 partition mounted. We've been having trouble trying to find out the cause of the issues. but can reliably reproduce such failures as follows: On each host, check for cats taking longer than 1 second: while true; do time cat /srv/healthchecks/smallfile> /dev/null; done 2>&1 | awk '/m[1-9]/ {print strftime(), $_}' To actually reproduce the failure, we then run a dd on the filestore: dd if=/dev/zero of=/srv/test/dd-test-`date +%s` bs=1M count=1000&& echo "Syncing"&& time sync At the time that the sync finishes, all of the NFS clients and the read-only server show that it took some time to return the cat of an unrelated file - usually the same amount of time it took to run the sync. What's the best place to start looking for the cause of these hangs? I've attached the dmesg output which includes some call traces for hung threads. I have stat_sysdir output though I suspect it's not so relevant. A scanlocks output doesn't reveal any busy locks that I can see (unless I'm not hitting it at the right time or misreading the output). For the DRBD replication there's a pair of bonded GBit NICs dedicated to the job. The other two GBit bonded NICs in the boxes are being used for NFS and o2net/ocfs communication. We don't believe that the network is at fault. We're using Debian Lenny with the stock AMD64 kernel out of the Lenny repository - 2.6.26-2-amd64. We're using ocfs2 tools version 1.4.4 which we have packaged for Debian ourselves. The ocfs2 version reported in /sys/module/ocfs2/version is 1.5.0. Here's the current o2cb configuration: r...@thumper5:#/srv/test# /etc/init.d/o2cb status Driver for "configfs": Loaded Filesystem "configfs": Mounted Stack glue driver: Loaded Stack plugin "o2cb": Loaded Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster thumperpool: Online Heartbeat dead threshold = 61 Network idle timeout: 60000 Network keepalive delay: 2000 Network reconnect delay: 2000 Checking O2CB heartbeat: Active We also tried a Heartbeat dead threshold of 31 with a Network idle timeout of 30000 to the same effect. Any assistance would be very much appreciated, Andrew_______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
