Re: Rolling 5.5.25?

Peter Rossbach Fri, 17 Aug 2007 13:29:01 -0700

Hi Filip,

yes, the "out of session sync" is the real cluster issue. We mustfind a way that member can see, that the other member come back.We must extend the current membership protocol to give receiver achance that a member come back after a io problem or restart.

But I have no idea to good strategy to resync sessions :(

The other thing that I want to change is: That a normal shutdown/restart of member can be signal of other members.Currently admin must wait sometime a long time before a normalrestart can made.


Peter



Am 17.08.2007 um 21:39 schrieb Filip Hanik - Dev Lists:

Peter Rossbach wrote:
Hi Filip,

OK, but second  is a real problem and frist you fix ;-)
Can you fix it as we call checkExpire at the RecoveryThread?
I don't know about this one, I could call checkExpire, but if thedatagram socket is down, then is the expiration real?I guess this should be done, to still guarantee correctnotifications according to how it works.
In a situation like this, your cluster will be out of sync, sinceonce the network card is backup, no state transfer is initiated again.
what are your thoughts?
Filip
Peter


Am 17.08.2007 um 21:11 schrieb Filip Hanik - Dev Lists:
There are a few drawbacks to my current implementation that Ineed to think about, these are
1. I also reset the membership map, this should probably not bedone at all2. During a failure, since I invoked stop, to reset the thread, Iam no longer sending out "member disappared" messages, as theservice is not running
Filip

Filip Hanik - Dev Lists wrote:
hi Peter,
here is the SVN link
http://svn.apache.org/viewvc?view=rev&revision=567104
basically what I do, in the receiver/sender thread, if an errorhappens, I increment a counter.
this counter also gets decremented upon success.
after X number of consecutive failures, I launch a new thread,called a RecoveryThread
this thread simply invokes stop->init->start until it succeeds.
The recovery thread is setup as a singleton, ie, only one canrun at any point in time.
I think you'll find that the solution in 6, is much simpler, asI don't have to change any code in the existing membership stuff.I had to pull out some initialization from the constructor intothe init() method, but after that I could use stop/init/start
without changing the sender or receiver threads.
I also changed the logging a little bit, only logging the erroronce (after that log at debug ) to avoid filling up the logs.
the recovery thread will log every 5 seconds.

So to really answer your question after all my bla bla,
Yes, the only option is to shut down the socket and start a newone. But to get it done right, I rely on the McastServiceImpl todo the right thing during stop() and start(),
instead of recoding that into a new method

Filip

Peter Rossbach wrote:
HI Filip,
can you explain your 6.0.x fix ((http://issues.apache.org/bugzilla/show_bug.cgi?id=40042).) a little bit, please?I think we hava only a chance to recover membership aftercluster membership send failure, to reopen the socket.
Here my current cluster 5.5 fix:

==
    public class SenderThread extends Thread {
        long time;
        McastServiceImpl service ;
        public SenderThread(long time, McastServiceImpl service) {
            this.time = time;
            this.service = service ;
            setName("Cluster-MembershipSender");

        }
        public void run() {
            long retry = 0 ;
            while ( doRun ) {
                try {
                    send();
                    retry = 0;
                } catch ( Exception x ) {
// FIXME: Only increment as network isreally down: NoRouteToHostException or BindException
                    retry++ ;
                    log.warn("Unable to send mcast message.",x);
                }

                if(retry > 0) {
                    if(retry * time < timeToExpiration ) {
                        try {
                            Thread.sleep(time);
                        } catch ( Exception ignore ) {}
                       restartHeartbeat(retry);
                    } else {
                        long recover = retry % 10 ;
                        try {
                            Thread.sleep((recover+1)*time);
                        } catch ( Exception ignore ) {}
                        if( recover == 0) {
                            restartHeartbeat(retry) ;
                        }
                    }
                }
            }
        }

        private void restartHeartbeat(long retry) {
            try {
                socket.leaveGroup(address);
            } catch (IOException ignore) {}
            try {
log.warn("Restarting membership heartbeat aftersend failure (number of recovery " + retry + ")");
                service.setupSocket();
                socket.joinGroup(address);
            } catch (IOException ignore) {}
        }

    }//class SenderThread
===
peter



Am 17.08.2007 um 19:56 schrieb Filip Hanik - Dev Lists:
Rainer Jung wrote:
Looks like an active weekend then ;)
I'm sorry, I just reread friday. Friday next week is totallyfine. No one should have to work on a weekend.also, for the mcast problem, I'm implementing a fix in 6.0 and6.x, you should be able to copy that one
Filip
I think that will suffice.

Regards,

Rainer

Filip Hanik - Dev Lists wrote:
sounds good, lets shoot for Tue or Wed next week then

Filip
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG Free Edition. Version: 7.5.484 / Virus Database:269.12.0/957 - Release Date: 8/16/2007 1:46 PM
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG Free Edition. Version: 7.5.484 / Virus Database:269.12.0/957 - Release Date: 8/16/2007 1:46 PM
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Rolling 5.5.25?

Reply via email to