hi Peter,
here is the SVN link
http://svn.apache.org/viewvc?view=rev&revision=567104
basically what I do, in the receiver/sender thread, if an error happens,
I increment a counter.
this counter also gets decremented upon success.
after X number of consecutive failures, I launch a new thread, called a
RecoveryThread
this thread simply invokes stop->init->start until it succeeds.
The recovery thread is setup as a singleton, ie, only one can run at any
point in time.
I think you'll find that the solution in 6, is much simpler, as I don't
have to change any code in the existing membership stuff.
I had to pull out some initialization from the constructor into the
init() method, but after that I could use stop/init/start
without changing the sender or receiver threads.
I also changed the logging a little bit, only logging the error once
(after that log at debug ) to avoid filling up the logs.
the recovery thread will log every 5 seconds.
So to really answer your question after all my bla bla,
Yes, the only option is to shut down the socket and start a new one. But
to get it done right, I rely on the McastServiceImpl to do the right
thing during stop() and start(),
instead of recoding that into a new method
Filip
Peter Rossbach wrote:
HI Filip,
can you explain your 6.0.x fix
((http://issues.apache.org/bugzilla/show_bug.cgi?id=40042).) a little
bit, please?
I think we hava only a chance to recover membership after cluster
membership send failure, to reopen the socket.
Here my current cluster 5.5 fix:
==
public class SenderThread extends Thread {
long time;
McastServiceImpl service ;
public SenderThread(long time, McastServiceImpl service) {
this.time = time;
this.service = service ;
setName("Cluster-MembershipSender");
}
public void run() {
long retry = 0 ;
while ( doRun ) {
try {
send();
retry = 0;
} catch ( Exception x ) {
// FIXME: Only increment as network is really
down: NoRouteToHostException or BindException
retry++ ;
log.warn("Unable to send mcast message.",x);
}
if(retry > 0) {
if(retry * time < timeToExpiration ) {
try {
Thread.sleep(time);
} catch ( Exception ignore ) {}
restartHeartbeat(retry);
} else {
long recover = retry % 10 ;
try {
Thread.sleep((recover+1)*time);
} catch ( Exception ignore ) {}
if( recover == 0) {
restartHeartbeat(retry) ;
}
}
}
}
}
private void restartHeartbeat(long retry) {
try {
socket.leaveGroup(address);
} catch (IOException ignore) {}
try {
log.warn("Restarting membership heartbeat after send
failure (number of recovery " + retry + ")");
service.setupSocket();
socket.joinGroup(address);
} catch (IOException ignore) {}
}
}//class SenderThread
===
peter
Am 17.08.2007 um 19:56 schrieb Filip Hanik - Dev Lists:
Rainer Jung wrote:
Looks like an active weekend then ;)
I'm sorry, I just reread friday. Friday next week is totally fine. No
one should have to work on a weekend.
also, for the mcast problem, I'm implementing a fix in 6.0 and 6.x,
you should be able to copy that one
Filip
I think that will suffice.
Regards,
Rainer
Filip Hanik - Dev Lists wrote:
sounds good, lets shoot for Tue or Wed next week then
Filip
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.484 / Virus Database: 269.12.0/957 - Release Date: 8/16/2007 1:46 PM
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]