[Bug 56828] New: Cluster setup stopped working after 3 months in production

bugzilla Fri, 08 Aug 2014 03:09:28 -0700

https://issues.apache.org/bugzilla/show_bug.cgi?id=56828


            Bug ID: 56828
           Summary: Cluster setup stopped working after 3 months in
                    production
           Product: Tomcat 6
           Version: 6.0.39
          Hardware: Other
                OS: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: Cluster
          Assignee: dev@tomcat.apache.org
          Reporter: krishna.saran...@gmail.com

We have J2EE war application deployed in a cluster setup having two nodes.
Tomcat 6.0.39 is installed in the both nodes having identical war deployed in
both. Its deployed in Amazon AWS environment, and the two ec2-nodes are beneath
an ELB , with session stickiness enabled for JSESSIONID. Also the two tomcat
nodes are session replication enabled too.

Following is Cluster config updated server.xml file:
=============================================================================
 <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster"
channelSendOptions="6" channelStartOptions="3">

<Manager className="org.apache.catalina.ha.session.DeltaManager"
expireSessionsOnShutdown="false" notifyListenersOnReplication="true" />

<Channel className="org.apache.catalina.tribes.group.GroupChannel">

<Receiver className="org.apache.catalina.tribes.transport.nio.NioReceiver"
                                autoBind="0" selectorTimeout="5000"
maxThreads="6"
                                address="x.x.x.x" port="4444" />
<Sender
className="org.apache.catalina.tribes.transport.ReplicationTransmitter">
<Transport
className="org.apache.catalina.tribes.transport.nio.PooledParallelSender"
                                        timeout="60000"
                                        keepAliveTime="10"
                                        keepAliveCount="0"
/>
</Sender>
<Interceptor
className="org.apache.catalina.tribes.group.interceptors.TcpPingInterceptor"
staticOnly="true"/>
<Interceptor
className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/>
<Interceptor
className="org.apache.catalina.tribes.group.interceptors.StaticMembershipInterceptor">
<Member className="org.apache.catalina.tribes.membership.StaticMember"
                                        host="x.x.x.x"
                                        port="4444"
                                       
uniqueId="{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4}"/>
</Interceptor>
</Channel>
<Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter="" />
<Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve" />
<ClusterListener
className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener"/>
<ClusterListener
className="org.apache.catalina.ha.session.ClusterSessionListener"/>
</Cluster>

==========================================================================

Receiver ip, static member ip and unique id is different in the server.xml of
the other node in the cluster.

this was running fine in production environment for 3 months. Suddenly there
was
an exception logged like this :, and started coming up infinitely.


==================================================
Aug 6, 2014 12:00:39 AM
org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
memberDisappeared
INFO: Received
memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://10.160.40.12:4444,10.160.40.12,4444,
alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 }, payload={}, command={},
domain={}, ]] message. Will verify.
Aug 6, 2014 12:00:39 AM
org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
memberDisappeared
INFO: Verification complete. Member still
alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://10.160.40.12:4444,10.160.40.12,4444,
alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 }, payload={}, command={},
domain={}, ]]
Aug 6, 2014 12:00:39 AM org.apache.catalina.ha.tcp.SimpleTcpCluster send
SEVERE: Unable to send message through cluster sender.
org.apache.catalina.tribes.ChannelException: Operation has timed out(60000
ms.).; Faulty members:tcp://10.160.40.12:4444;
        at
org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessage(ParallelNioSender.java:97)
        at
org.apache.catalina.tribes.transport.nio.PooledParallelSender.sendMessage(PooledParallelSender.java:53)
        at
org.apache.catalina.tribes.transport.ReplicationTransmitter.sendMessage(ReplicationTransmitter.java:80)
        at
org.apache.catalina.tribes.group.ChannelCoordinator.sendMessage(ChannelCoordinator.java:76)
        at
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
        at
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
        at
org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.sendMessage(TcpFailureDetector.java:88)
        at
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
        at
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
        at
org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:216)
        at
org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:175)
        at
org.apache.catalina.ha.tcp.SimpleTcpCluster.send(SimpleTcpCluster.java:817)
        at
org.apache.catalina.ha.tcp.SimpleTcpCluster.sendClusterDomain(SimpleTcpCluster.java:791)
        at
org.apache.catalina.ha.tcp.ReplicationValve.send(ReplicationValve.java:553)
        at
org.apache.catalina.ha.tcp.ReplicationValve.sendMessage(ReplicationValve.java:537)
        at
org.apache.catalina.ha.tcp.ReplicationValve.sendSessionReplicationMessage(ReplicationValve.java:519)
        at
org.apache.catalina.ha.tcp.ReplicationValve.sendReplicationMessage(ReplicationValve.java:430)
        at
org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:363)
        at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
        at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606)
        at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
        at java.lang.Thread.run(Thread.java:662)
============================================================================


After this, the web application is not accessible, and we have to manually kill
the tomcat process in one node, thereby disabling the cluster.


We are unsure, how all of a sudden this is coming, and disabling application
access altogether. If there are any suggestion on remedy, pls provide the same.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

[Bug 56828] New: Cluster setup stopped working after 3 months in production

Reply via email to