https://issues.apache.org/bugzilla/show_bug.cgi?id=46808
--- Comment #14 from Rainer Jung <rainer.j...@kippdata.de> 2009-03-09 20:01:13 PST --- I discussed with Mladen and we both committed a few changes. But let's first express our expectation to what should happen. We don't want the system to behave overly nervous. If the load balancer marks a node as in ERROR, no more requests will be send there for some time. Most applications need stickyness, and in many cases people do not use session replication. So marking a node as being in ERROR has serious implications for all people whose sessions live on the node and who try to access it as long as it is in error. On the other hand sending traffic to a node that really is broken obviously also has serious implications. Now what everyone needs to do is using the socket_connect_timeout and CPing/CPong to check that the node has some basic connectivity available. With these fatures each request can make sure for itself, that it will fail over to another node in a relatively timely and robust manner. I think that worked in your situation. Now what we don't want to do is as soon as a node doesn't react on one CPing or one connection attempt, taking it out of service (marking as in ERROR). Instead we implemented now, that we are looking at the timestamp of when we last had such bad behaviour and if this is longer ago than recover_time/2. In that case we will mark the node as ERROR. In your case this should mean: directly after the cable breaks (let's assume you didn't plug the cable but it magically broke), the system will behave like it does for 1.2.27. All requests for that node will take a little longer, because they first have to go through their connect or Cping/Cpong timeouts. But after (default) 30 seconds, the node should be put into the global ERROR state, so no more requests will be sent there. Unless every now and then a request suceeds, which means the node isn't totally broken. I think that should be a good compromise. It limits the amount of time an initial problem, like the broken cable, negatively influences the system, but it also limits the negative influence a temporarily overloaded node could have, if we immediately would put it into ERROR. The magical limit (recover_wait/2) is configurable independently, so you can get close to the 1.2.22 behaviour by setting error_escalation_time to 0, but I don't recommend doing that. This is new code, and I hope you find a chance of testing it. I will put it into http://people.apache.org/~rjung/mod_jk-dev/source/jk-1.2.28-dev/ in a minute. It might not yet be the latest revision for 1.2.28, but I think it's close :) -- Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org