Hi Filip,
yes, the "out of session sync" is the real cluster issue. We must
find a way that member can see, that the other member come back.
We must extend the current membership protocol to give receiver a
chance that a member come back after a io problem or restart.
But I have no idea to good strategy to resync sessions :(
The other thing that I want to change is: That a normal shutdown/
restart of member can be signal of other members.
Currently admin must wait sometime a long time before a normal
restart can made.
Peter
Am 17.08.2007 um 21:39 schrieb Filip Hanik - Dev Lists:
Peter Rossbach wrote:
Hi Filip,
OK, but second is a real problem and frist you fix ;-)
Can you fix it as we call checkExpire at the RecoveryThread?
I don't know about this one, I could call checkExpire, but if the
datagram socket is down, then is the expiration real?
I guess this should be done, to still guarantee correct
notifications according to how it works.
In a situation like this, your cluster will be out of sync, since
once the network card is backup, no state transfer is initiated again.
what are your thoughts?
Filip
Peter
Am 17.08.2007 um 21:11 schrieb Filip Hanik - Dev Lists:
There are a few drawbacks to my current implementation that I
need to think about, these are
1. I also reset the membership map, this should probably not be
done at all
2. During a failure, since I invoked stop, to reset the thread, I
am no longer sending out "member disappared" messages, as the
service is not running
Filip
Filip Hanik - Dev Lists wrote:
hi Peter,
here is the SVN link
http://svn.apache.org/viewvc?view=rev&revision=567104
basically what I do, in the receiver/sender thread, if an error
happens, I increment a counter.
this counter also gets decremented upon success.
after X number of consecutive failures, I launch a new thread,
called a RecoveryThread
this thread simply invokes stop->init->start until it succeeds.
The recovery thread is setup as a singleton, ie, only one can
run at any point in time.
I think you'll find that the solution in 6, is much simpler, as
I don't have to change any code in the existing membership stuff.
I had to pull out some initialization from the constructor into
the init() method, but after that I could use stop/init/start
without changing the sender or receiver threads.
I also changed the logging a little bit, only logging the error
once (after that log at debug ) to avoid filling up the logs.
the recovery thread will log every 5 seconds.
So to really answer your question after all my bla bla,
Yes, the only option is to shut down the socket and start a new
one. But to get it done right, I rely on the McastServiceImpl to
do the right thing during stop() and start(),
instead of recoding that into a new method
Filip
Peter Rossbach wrote:
HI Filip,
can you explain your 6.0.x fix ((http://issues.apache.org/
bugzilla/show_bug.cgi?id=40042).) a little bit, please?
I think we hava only a chance to recover membership after
cluster membership send failure, to reopen the socket.
Here my current cluster 5.5 fix:
==
public class SenderThread extends Thread {
long time;
McastServiceImpl service ;
public SenderThread(long time, McastServiceImpl service) {
this.time = time;
this.service = service ;
setName("Cluster-MembershipSender");
}
public void run() {
long retry = 0 ;
while ( doRun ) {
try {
send();
retry = 0;
} catch ( Exception x ) {
// FIXME: Only increment as network is
really down: NoRouteToHostException or BindException
retry++ ;
log.warn("Unable to send mcast message.",x);
}
if(retry > 0) {
if(retry * time < timeToExpiration ) {
try {
Thread.sleep(time);
} catch ( Exception ignore ) {}
restartHeartbeat(retry);
} else {
long recover = retry % 10 ;
try {
Thread.sleep((recover+1)*time);
} catch ( Exception ignore ) {}
if( recover == 0) {
restartHeartbeat(retry) ;
}
}
}
}
}
private void restartHeartbeat(long retry) {
try {
socket.leaveGroup(address);
} catch (IOException ignore) {}
try {
log.warn("Restarting membership heartbeat after
send failure (number of recovery " + retry + ")");
service.setupSocket();
socket.joinGroup(address);
} catch (IOException ignore) {}
}
}//class SenderThread
===
peter
Am 17.08.2007 um 19:56 schrieb Filip Hanik - Dev Lists:
Rainer Jung wrote:
Looks like an active weekend then ;)
I'm sorry, I just reread friday. Friday next week is totally
fine. No one should have to work on a weekend.
also, for the mcast problem, I'm implementing a fix in 6.0 and
6.x, you should be able to copy that one
Filip
I think that will suffice.
Regards,
Rainer
Filip Hanik - Dev Lists wrote:
sounds good, lets shoot for Tue or Wed next week then
Filip
----------------------------------------------------------------
-----
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-----------------------------------------------------------------
----
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------------------
------
No virus found in this incoming message.
Checked by AVG Free Edition. Version: 7.5.484 / Virus Database:
269.12.0/957 - Release Date: 8/16/2007 1:46 PM
-------------------------------------------------------------------
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
---
No virus found in this incoming message.
Checked by AVG Free Edition. Version: 7.5.484 / Virus Database:
269.12.0/957 - Release Date: 8/16/2007 1:46 PM
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]