Hi Filip,

yes, the "out of session sync" is the real cluster issue. We must find a way that member can see, that the other member come back. We must extend the current membership protocol to give receiver a chance that a member come back after a io problem or restart.
But I have no idea to good strategy to resync sessions :(
The other thing that I want to change is: That a normal shutdown/ restart of member can be signal of other members. Currently admin must wait sometime a long time before a normal restart can made.

Peter



Am 17.08.2007 um 21:39 schrieb Filip Hanik - Dev Lists:

Peter Rossbach wrote:
Hi Filip,

OK, but second  is a real problem and frist you fix ;-)
Can you fix it as we call checkExpire at the RecoveryThread?
I don't know about this one, I could call checkExpire, but if the datagram socket is down, then is the expiration real? I guess this should be done, to still guarantee correct notifications according to how it works.

In a situation like this, your cluster will be out of sync, since once the network card is backup, no state transfer is initiated again.
what are your thoughts?
Filip


Peter


Am 17.08.2007 um 21:11 schrieb Filip Hanik - Dev Lists:

There are a few drawbacks to my current implementation that I need to think about, these are

1. I also reset the membership map, this should probably not be done at all 2. During a failure, since I invoked stop, to reset the thread, I am no longer sending out "member disappared" messages, as the service is not running

Filip

Filip Hanik - Dev Lists wrote:
hi Peter,
here is the SVN link
http://svn.apache.org/viewvc?view=rev&revision=567104

basically what I do, in the receiver/sender thread, if an error happens, I increment a counter.
this counter also gets decremented upon success.
after X number of consecutive failures, I launch a new thread, called a RecoveryThread
this thread simply invokes stop->init->start until it succeeds.

The recovery thread is setup as a singleton, ie, only one can run at any point in time.

I think you'll find that the solution in 6, is much simpler, as I don't have to change any code in the existing membership stuff. I had to pull out some initialization from the constructor into the init() method, but after that I could use stop/init/start
without changing the sender or receiver threads.

I also changed the logging a little bit, only logging the error once (after that log at debug ) to avoid filling up the logs.
the recovery thread will log every 5 seconds.

So to really answer your question after all my bla bla,
Yes, the only option is to shut down the socket and start a new one. But to get it done right, I rely on the McastServiceImpl to do the right thing during stop() and start(),
instead of recoding that into a new method

Filip

Peter Rossbach wrote:
HI Filip,

can you explain your 6.0.x fix ((http://issues.apache.org/ bugzilla/show_bug.cgi?id=40042).) a little bit, please? I think we hava only a chance to recover membership after cluster membership send failure, to reopen the socket.

Here my current cluster 5.5 fix:

==
    public class SenderThread extends Thread {
        long time;
        McastServiceImpl service ;
        public SenderThread(long time, McastServiceImpl service) {
            this.time = time;
            this.service = service ;
            setName("Cluster-MembershipSender");

        }
        public void run() {
            long retry = 0 ;
            while ( doRun ) {
                try {
                    send();
                    retry = 0;
                } catch ( Exception x ) {
// FIXME: Only increment as network is really down: NoRouteToHostException or BindException
                    retry++ ;
                    log.warn("Unable to send mcast message.",x);
                }

                if(retry > 0) {
                    if(retry * time < timeToExpiration ) {
                        try {
                            Thread.sleep(time);
                        } catch ( Exception ignore ) {}
                       restartHeartbeat(retry);
                    } else {
                        long recover = retry % 10 ;
                        try {
                            Thread.sleep((recover+1)*time);
                        } catch ( Exception ignore ) {}
                        if( recover == 0) {
                            restartHeartbeat(retry) ;
                        }
                    }
                }
            }
        }

        private void restartHeartbeat(long retry) {
            try {
                socket.leaveGroup(address);
            } catch (IOException ignore) {}
            try {
log.warn("Restarting membership heartbeat after send failure (number of recovery " + retry + ")");
                service.setupSocket();
                socket.joinGroup(address);
            } catch (IOException ignore) {}
        }

    }//class SenderThread
===
peter



Am 17.08.2007 um 19:56 schrieb Filip Hanik - Dev Lists:

Rainer Jung wrote:
Looks like an active weekend then ;)
I'm sorry, I just reread friday. Friday next week is totally fine. No one should have to work on a weekend. also, for the mcast problem, I'm implementing a fix in 6.0 and 6.x, you should be able to copy that one

Filip


I think that will suffice.

Regards,

Rainer

Filip Hanik - Dev Lists wrote:
sounds good, lets shoot for Tue or Wed next week then

Filip

---------------------------------------------------------------- -----
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





----------------------------------------------------------------- ----
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




------------------------------------------------------------------ ------

No virus found in this incoming message.
Checked by AVG Free Edition. Version: 7.5.484 / Virus Database: 269.12.0/957 - Release Date: 8/16/2007 1:46 PM



------------------------------------------------------------------- --
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-------------------------------------------------------------------- -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--------------------------------------------------------------------- ---

No virus found in this incoming message.
Checked by AVG Free Edition. Version: 7.5.484 / Virus Database: 269.12.0/957 - Release Date: 8/16/2007 1:46 PM



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to