Rainer,
On 5/26/22 16:46, Rainer Jung wrote:
Hi Chris,
Am 16.05.2022 um 19:48 schrieb Christopher Schultz:
I've been looking into this a little more in my production environment.
These errors are not super common, but there seems to be a steady
trickle of errors from my two services that have human users. I see 0
errors for my API-based services, which makes me think that I don't
have any performance issues... I probably have human users
disappearing for random reasons.
Could be unstable (mobile) client connections. Or people already
clicking on the next frontend action before they received the complete
response. But that is speculating. So it is right, you try to identify
some individual reasons to understand more.
The errors in mod_jk.log look like this:
[Sun May 15 04:19:15.643 2022] [5859:139625025315392] [info]
ajp_process_callback::jk_ajp_common.c (2077): (myworker) Writing to
client aborted or client network problems
[Sun May 15 04:19:15.644 2022] [5859:139625025315392] [info]
ajp_service::jk_ajp_common.c (2773): (myworker) sending request to
tomcat failed (unrecoverable), because of client write error (attempt=1)
[Sun May 15 04:19:15.646 2022] [5859:139625025315392] [info]
service::jk_lb_worker.c (1595): service failed, worker myworker is in
local error state
[Sun May 15 04:19:15.646 2022] [5859:139625025315392] [info]
service::jk_lb_worker.c (1614): unrecoverable error 200, request
failed. Client failed in the middle of request, we can't recover to
another instance.
[Sun May 15 04:19:15.646 2022] [5859:139625025315392] [info]
jk_handler::mod_jk.c (2984): Aborting connection for worker=myworker
(Note that the last message "Aborting connection for worker=myworker"
may have a bug; my actual worker name has a name containing a hyphen
(-) and only the text before the hyphen is being emitted in that error
message.)
Strange, never observed that, but maybe never used a hyphen. Docs say,
hypens are allowed. Would be interesting to do a server startup with
trace-Logging and see where things corresponding to the name start to go
wrong. But of course not related to sporadic client failures.
Right. I may investigate that separately, as I have a setup already with
everything in place.
Anyway, when researching these errors, it would be helpful to me to
know which requests are failing. By looking at the access_log, I only
see a single request logged for 04:19:15 on that server so it's
probably the right one, but httpd says that the response code is 302
instead of e.g. 50x for an error.
What I typically do:
- log "%P:%{tid}P" in your Apache httpd custom LogFormat used for the
access log.
- make sure, I log in in the Apache httpd access log the request
timestamp including milli or microseconds (not default but
configurable). Can be done by using the %{format}t syntax in the
LogFormat and adding "usec_frac" to the format.
- adding %D to the access log format (duration in microseconds)
- remember that Apache logs start of request as default time stamp, but
mod_jk logs at the moment of error, so later than start of request.
Finding the right access log line for a mod_jk error log line now means:
- filter the access log according to the PID:TID logged in the mod_jk
error log. In your case 5859:139625025315392. We know, that the requests
handled by one thread in one process are run strictly sequentially.
- look for the last request in this filtered list, that by access log
line timestamp started before (or unlikely exactly at) the point in time
given by the mod_jk access log. If you find one exactly add, it might be
also the one directly before.
- look at the request durations of these one or two requests to double
check, whether the times fit.
If you can spare the additional log line bytes, you can additionally log
the end of request timestamp in the apache access log (prefix "format"
by "begin:").
Especially by adding this "enhanced logging", it was very easy to find
the failing requests.
Fortunately for me, the JK log is "now" and the request_time is the
start-of-request, so I can see the delay between the two. In the cases
I've seen since I started watching the log, it's typically a very short
tie like 1-2 sec which shouldn't be something a user gets tired waiting for.
I was more worried like "mod_jk waited 35 seconds for a response from
upstream and the user went away" and that's not the case.
So I'm happy to find that the reason for these errors is pathological
user behavior and not some performance problem on my end.
Thanks,
-chris
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org