[Ocfs2-users] Debugging help / Guidance on architecture

Damon Miller Fri, 15 May 2009 10:31:34 -0700

Hi all.  This will be my first post to the mailing list so I apologize in 
advance if I omit relevant configuration / setup details.  Please let me know 
what additional information is needed and I'll gladly supply it.



We're running a 3-node OCFS2 1.2.9 cluster with a 5-TB iSCSI block device as 
the backing store.  All machines are running CentOS, with the iSCSI target 
running CentOS 5.2 and the initiators running CentOS 4.7.  The purpose of the 
cluster is to evaluate alternatives to our current solution for replicating 
audio files which are generated from multiple PBX servers running Asterisk.


We currently use Unison for file-level replication to and from a dedicated 
machine such that there are multiple copies of the audio tree--one per PBX 
server.  This allows us to quickly and easily move customers among our servers 
for load-balancing and disaster recovery purposes.  Unfortunately, we're 
encountering scalability problems with the Unison-based approach, e.g. 
conflicts, slow propogation time, etc.


The hope was that moving to a clustered filesystem would improve propogation 
time, reduce conflicts, and allow us to scale more effectively.  I chose OCFS2 
because it seemed the simplest solution architecturally and because of its 
certification by Oracle for use with the database product.  (My thought was 
that Oracle's certification requirements would likely supercede those of a 
general-purpose filesystem, though please correct me if this was naïve or 
misguided.)

Having said all that, this morning around 7:00am EDT we began seeing 
OCFS2-related errors in one of our server's syslog.  Specifically:

--

May 15 07:08:00 cam-c6 kernel: o2net: no longer connected to node cam-p1 (num 
1) at 10.10.89.110:7777
May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_broadcast_vote:731 ERROR: status 
= -112
May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_do_request_vote:804 ERROR: 
status = -112
May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_rename:1207 ERROR: status = -112
May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_broadcast_vote:731 ERROR: status 
= -107
May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_do_request_vote:804 ERROR: 
status = -107
May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_rename:1103 ERROR: status = -107

[last message repeated many times]

May 15 07:08:30 cam-c6 kernel: (4335,0):o2net_connect_expired:1585 ERROR: no 
connection established with n
ode 1 after 30.0 seconds, giving up and returning errors.

...

May 15 09:22:29 cam-c6 kernel: (4335,0):o2net_connect_expired:1585 ERROR: no 
connection established with n
ode 1 after 30.0 seconds, giving up and returning errors.

--


This continued until 9:22am EDT, at which point one of our engineers manually 
rebooted the machine in an attempt to remedy the voicemail problems in response 
to Asterisk complaining of read/write problems to its voicemail tree.

I was surprised OCFS2 didn't panic the kernel and automatically reboot the 
machine after the 30-second timeout.  I thought this was the default behavior 
and in fact I forced this condition by manually stopping the iSCSI daemon 
during preliminary testing.  Instead, the kernel complained for over two hours 
before someone manually rebooted the machine, at which point the cluster 
reconnected and resumed operation.  Is this expected?

According to the relevant switch (a managed Cisco) there was no interruption in 
network connectivity between these two machines.  Neither server logged 
anything related to a network link failure so the only real information I have 
is from OCFS2.  Frankly I'm not sure how to proceed from here but I obviously 
want to address the reliability concerns this problem raises since we're 
considering OCFS2 for replacing our existing solution throughout our 
datacenters.

I tried to map the numerical error codes -112 and -107 to specific problems 
based on the code ('tcp.c' and 'vote.c' in particular) but I was unsuccessful.




In general, I suppose I'm curious if anyone has high-level feedback on the 
planned use of OCFS2 in this scenario.  Am I overcomplicating things?  Assuming 
the pilot works, we do plan to roll out a dedicated storage network which will 
include redundant switching, NICs, iSCSI targets with multiple paths to the 
physical storage, etc.  I just need to validate the basic approach at present.


Thanks in advance for any information you can provide.  I've attached our 
'cluster.conf' file to this message.  At present, only nodes 0, 1, and 7 are 
connected to the cluster.  I included the other nodes in the config file so we 
could easily add them if we confirmed reliable operation through the pilot.  In 
this configuration, 'cam-s1' is the iSCSI target while 'cam-p1' and 'cam-c6' 
are the connected nodes in the cluster.  Here is output from 'df', 
'mounted.ocfs2', and 'iscsi-ls':

[r...@cam-c6 ~]# df -H
Filesystem             Size   Used  Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       294G    66G   213G  24% /
/dev/sda1              104M    22M    78M  22% /boot
none                   7.5G      0   7.5G   0% /dev/shm
none                   8.6G      0   8.6G   0% /mnt/ramdisk
/dev/sdc1              5.0T    82G   4.9T   2% /store1

[r...@cam-c6 ~]# mounted.ocfs2 -d
Device                FS     UUID                                  Label
/dev/sdc1             ocfs2  52415cf6-22e8-4a2c-a090-0f0448366e63  store1

[r...@cam-c6 ~]# iscsi-ls
*******************************************************************************
SFNet iSCSI Driver Version ...4:0.1.11-7(14-Apr-2008)
*******************************************************************************
TARGET NAME             : iqn.2009-01.com.thinkingphones:iscsi-tgt1:store1
TARGET ALIAS            : 
HOST ID                 : 3
BUS ID                  : 0
TARGET ID               : 0
TARGET ADDRESS          : 10.10.89.105:3260,1
SESSION STATUS          : ESTABLISHED AT Fri May 15 09:28:05 EDT 2009
SESSION ID              : ISID 00023d000001 TSIH f00
*******************************************************************************



Regards,


Damon

cluster.conf
Description: cluster.conf

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] Debugging help / Guidance on architecture

Reply via email to