Charles Loots created LOGBACK-1402:
--------------------------------------
Summary: SocketAppender not detecting dirty TCP disconnect. Log
events go missing.
Key: LOGBACK-1402
URL: https://jira.qos.ch/browse/LOGBACK-1402
Project: logback
Issue Type: Bug
Components: logback-core
Affects Versions: 1.2.3
Environment: I have set up several processes instances, each using
logback. The instances are meant to be containerized in docker and scale out.
The instances (senders) use SocketAppender to centralize log events to a
different container, where a process (receiver) runs with logback with a
ServerSocketReceiver. At this single instance i am able to further deal with
the log events in a central place.
The setup works perfectly with one exception. When the receiver container is
shut down, the senders are not aware of the disconnect and take about 10
minutes to become aware of the problem with the socket, despite nearly constant
logging taking place. Only at this point to they reconnect.
Reporter: Charles Loots
Assignee: Logback dev list
It would appear that a remote container shutdown results in a dirty TCP
disconnect. The remote TCP server disappears abruptly, leaving the client side
with the socket in CLOSE_WAIT state.
While this is the case the client side is happily appending events to the
SocketAppender which in turn seems to be writing this to the socket (the Send-Q
on the socket keeps growing) without detecting any errors.
This keeps going for several minutes (between 7 and 12 in my tests) during
which log events go missing. Only at this very late point is an error raised
and the reconnect logic triggered.
I found that the *AbstractLogstashTcpSocketAppender* from the logstash project
encountered and fixed this same problem.
[https://github.com/logstash/logstash-logback-encoder/issues/76]
[https://github.com/logstash/logstash-logback-encoder/issues/85]
They effectively fixed the problem by attempting a read from the socket (with
short read timeout) which results in a much more timely detection of the
disconnect . This could take place on a separate watchdog thread to detect the
problem.
[https://github.com/logstash/logstash-logback-encoder/blob/master/src/main/java/net/logstash/logback/appender/AbstractLogstashTcpSocketAppender.java#L385]
The call() method tries to constantly read from the socket. The expected
outcome is that -1 bytes were read within the read timeout period, and this is
where an exception will be much more reliably triggered when the socket is no
longer available.
--
This message was sent by Atlassian JIRA
(v7.3.1#73012)
_______________________________________________
logback-dev mailing list
[email protected]
http://mailman.qos.ch/mailman/listinfo/logback-dev