DO NOT REPLY [Bug 52334] New: recover_time is not properly used

bugzilla Wed, 14 Dec 2011 13:47:41 -0800

https://issues.apache.org/bugzilla/show_bug.cgi?id=52334


             Bug #: 52334
           Summary: recover_time is not properly used
           Product: Tomcat Connectors
           Version: 1.2.32
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: Common
        AssignedTo: dev@tomcat.apache.org
        ReportedBy: aogb...@redhat.com
    Classification: Unclassified


The recover_time applies correctly the first time, but only the first time. 
This becomes evident when setting recover_time larger than 60 seconds and
larger than worker.maintain.  After the first recover attempt passes and the
worker errors back out to its global error state, the worker will be placed in
recovery mode again on the next invocation of the worker.maintain method.

Testing with recover_time at 120 and error_escalation_time at 0 for simplicity
and worker.maintain at the default 60, I see the worker is in global error upon
a failure and then goes to recovery properly the first time after >120 seconds,
but each subsequent time, it is placed in recovery after ~60 seconds when the
maintenance method fires.  I then tested this with a drastically lowered
worker.maintain at 5 seconds.  Again, I see the first time the 2 minute
recover_time is correctly used.  But here, the worker is retried and placed
into ERR/REC after just 5 seconds.  I tested again with worker.maintain at 15
seconds.  Again, I see the first time the 2 minute recover_time is correctly
used.  But here, the worker is retried and placed into ERR/REC after 15
seconds.

With these tests, I see that the LE (last error time) reported for the worker
by jkstatus never changes after it is set on the initial error, which may be
due to the following generic error block in jk_lb_worker.c's service method as
it only sets error_time when it is 0:

                else {
                    /*
                     * Various unspecific error cases.
                     * Keep previous global state, if we are not in local error
since to long.
                     * Do not try to reuse the same node for the same request.
                     * Failing over to another node could help.
                     */
                    time_t now = time(NULL);
                    rec->s->errors++;
                    if (rec->s->busy == 0 ||
                        p->worker->error_escalation_time == 0 ||
                        (rec->s->error_time > 0 &&
                         (int)difftime(now, rec->s->error_time) >=
p->worker->error_escalation_time)) {
                        if (JK_IS_DEBUG_LEVEL(l))
                            jk_log(l, JK_LOG_DEBUG,
                                   "worker %s escalating local error to global
error",
                                   rec->name);
                        rec->s->state = JK_LB_STATE_ERROR;
                    }
                    p->states[rec->i] = JK_LB_STATE_ERROR;
                    if (rec->s->error_time == 0) {
                        rec->s->error_time = now;
                    }
                    rc = JK_FALSE;
                }

And the unchanged error time appears to be the root cause of the improper
recoveries after checking code from recover_workers in jk_lb_worker.c:

        if (w->s->state == JK_LB_STATE_ERROR) {
            elapsed = (int)difftime(now, w->s->error_time);
            if (elapsed <= p->recover_wait_time) {
                if (JK_IS_DEBUG_LEVEL(l))
                    jk_log(l, JK_LOG_DEBUG,
                           "worker %s will recover in %d seconds",
                           w->name, p->recover_wait_time - elapsed);
            }
            else {
                if (JK_IS_DEBUG_LEVEL(l))
                    jk_log(l, JK_LOG_DEBUG,
                           "worker %s is marked for recovery",
                           w->name);
                if (p->lbmethod != JK_LB_METHOD_BUSYNESS)
                    w->s->lb_value = curmax;
                aw->s->reply_timeouts = 0;
                w->s->state = JK_LB_STATE_RECOVER;
                non_error++;
            }
        }


So after the first pass, error_time never updates again and so elapsed is
always > recover_wait_time, resulting in the worker being placed in recovery on
each call of the worker maintenance.

But I see too from the service method that escalation to global error is also
based upon this error_time, which might could cause issues if error_time is
updated for the sake of the recover_time check.  A node should be sent back to
global error if it fails in recovery mode, but if error_time is updated for
recover_time to function properly, would this still resolve to a global error
state?

Perhaps, the best route would be two separate variables: the error_time used
for the sake of error_escalation and only set/updated on error when error_time
is 0 and then a last_error_time for the sake of recover wait time that is
updated on any error occurrence.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

DO NOT REPLY [Bug 52334] New: recover_time is not properly used

Reply via email to