Hi Filip,

OK, but second  is a real problem and frist you fix ;-)
Can you fix it as we call checkExpire at the RecoveryThread?

Peter


Am 17.08.2007 um 21:11 schrieb Filip Hanik - Dev Lists:

There are a few drawbacks to my current implementation that I need to think about, these are

1. I also reset the membership map, this should probably not be done at all 2. During a failure, since I invoked stop, to reset the thread, I am no longer sending out "member disappared" messages, as the service is not running

Filip

Filip Hanik - Dev Lists wrote:
hi Peter,
here is the SVN link
http://svn.apache.org/viewvc?view=rev&revision=567104

basically what I do, in the receiver/sender thread, if an error happens, I increment a counter.
this counter also gets decremented upon success.
after X number of consecutive failures, I launch a new thread, called a RecoveryThread
this thread simply invokes stop->init->start until it succeeds.

The recovery thread is setup as a singleton, ie, only one can run at any point in time.

I think you'll find that the solution in 6, is much simpler, as I don't have to change any code in the existing membership stuff. I had to pull out some initialization from the constructor into the init() method, but after that I could use stop/init/start
without changing the sender or receiver threads.

I also changed the logging a little bit, only logging the error once (after that log at debug ) to avoid filling up the logs.
the recovery thread will log every 5 seconds.

So to really answer your question after all my bla bla,
Yes, the only option is to shut down the socket and start a new one. But to get it done right, I rely on the McastServiceImpl to do the right thing during stop() and start(),
instead of recoding that into a new method

Filip

Peter Rossbach wrote:
HI Filip,

can you explain your 6.0.x fix ((http://issues.apache.org/ bugzilla/show_bug.cgi?id=40042).) a little bit, please? I think we hava only a chance to recover membership after cluster membership send failure, to reopen the socket.

Here my current cluster 5.5 fix:

==
    public class SenderThread extends Thread {
        long time;
        McastServiceImpl service ;
        public SenderThread(long time, McastServiceImpl service) {
            this.time = time;
            this.service = service ;
            setName("Cluster-MembershipSender");

        }
        public void run() {
            long retry = 0 ;
            while ( doRun ) {
                try {
                    send();
                    retry = 0;
                } catch ( Exception x ) {
// FIXME: Only increment as network is really down: NoRouteToHostException or BindException
                    retry++ ;
                    log.warn("Unable to send mcast message.",x);
                }

                if(retry > 0) {
                    if(retry * time < timeToExpiration ) {
                        try {
                            Thread.sleep(time);
                        } catch ( Exception ignore ) {}
                       restartHeartbeat(retry);
                    } else {
                        long recover = retry % 10 ;
                        try {
                            Thread.sleep((recover+1)*time);
                        } catch ( Exception ignore ) {}
                        if( recover == 0) {
                            restartHeartbeat(retry) ;
                        }
                    }
                }
            }
        }

        private void restartHeartbeat(long retry) {
            try {
                socket.leaveGroup(address);
            } catch (IOException ignore) {}
            try {
log.warn("Restarting membership heartbeat after send failure (number of recovery " + retry + ")");
                service.setupSocket();
                socket.joinGroup(address);
            } catch (IOException ignore) {}
        }

    }//class SenderThread
===
peter



Am 17.08.2007 um 19:56 schrieb Filip Hanik - Dev Lists:

Rainer Jung wrote:
Looks like an active weekend then ;)
I'm sorry, I just reread friday. Friday next week is totally fine. No one should have to work on a weekend. also, for the mcast problem, I'm implementing a fix in 6.0 and 6.x, you should be able to copy that one

Filip


I think that will suffice.

Regards,

Rainer

Filip Hanik - Dev Lists wrote:
sounds good, lets shoot for Tue or Wed next week then

Filip

------------------------------------------------------------------ ---
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





------------------------------------------------------------------- --
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-------------------------------------------------------------------- ----

No virus found in this incoming message.
Checked by AVG Free Edition. Version: 7.5.484 / Virus Database: 269.12.0/957 - Release Date: 8/16/2007 1:46 PM



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to