https://issues.apache.org/bugzilla/show_bug.cgi?id=46808





--- Comment #14 from Rainer Jung <rainer.j...@kippdata.de>  2009-03-09 20:01:13 
PST ---
I discussed with Mladen and we both committed a few changes.

But let's first express our expectation to what should happen.

We don't want the system to behave overly nervous. If the load balancer marks a
node as in ERROR, no more requests will be send there for some time. Most
applications need stickyness, and in many cases people do not use session
replication. So marking a node as being in ERROR has serious implications for
all people whose sessions live on the node and who try to access it as long as
it is in error.

On the other hand sending traffic to a node that really is broken obviously
also has serious implications.

Now what everyone needs to do is using the socket_connect_timeout and
CPing/CPong to check that the node has some basic connectivity available. With
these fatures each request can make sure for itself, that it will fail over to
another node in a relatively timely and robust manner. I think that worked in
your situation.

Now what we don't want to do is as soon as a node doesn't react on one CPing or
one connection attempt, taking it out of service (marking as in ERROR).

Instead we implemented now, that we are looking at the timestamp of when we
last had such bad behaviour and if this is longer ago than recover_time/2. In
that case we will mark the node as ERROR.

In your case this should mean: directly after the cable breaks (let's assume
you didn't plug the cable but it magically broke), the system will behave like
it does for 1.2.27. All requests for that node will take a little longer,
because they first have to go through their connect or Cping/Cpong timeouts.
But after (default) 30 seconds, the node should be put into the global ERROR
state, so no more requests will be sent there. Unless every now and then a
request suceeds, which means the node isn't totally broken.

I think that should be a good compromise. It limits the amount of time an
initial problem, like the broken cable, negatively influences the system, but
it also limits the negative influence a temporarily overloaded node could have,
if we immediately would put it into ERROR.

The magical limit (recover_wait/2) is configurable independently, so you can
get close to the 1.2.22 behaviour by setting error_escalation_time to 0, but I
don't recommend doing that.

This is new code, and I hope you find a chance of testing it. I will put it
into

http://people.apache.org/~rjung/mod_jk-dev/source/jk-1.2.28-dev/

in a minute.

It might not yet be the latest revision for 1.2.28, but I think it's close :)

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to