gharris1727 commented on code in PR #15305:
URL: https://github.com/apache/kafka/pull/15305#discussion_r1591278453
##########
connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/WorkerCoordinator.java:
##########
@@ -267,6 +267,18 @@ public String memberId() {
return JoinGroupRequest.UNKNOWN_MEMBER_ID;
}
+ @Override
+ protected void handlePollTimeoutExpiry() {
+ log.warn("worker poll timeout has expired. This means the time between
subsequent calls to poll() " +
+ "in DistributedHerder tick() method was longer than the configured
rebalance.timeout.ms. " +
+ "If you see this happening consistently, then it can be addressed
by either adding more workers " +
+ "to the connect cluster or by increasing the rebalance.timeout.ms
configuration value. Please note that " +
Review Comment:
I think this is decent advice when requests are small and can be distributed
around the cluster, but as REST requests are rather infrequent, I think this is
the minority of cases.
I think most often this timeout is going to be triggered by an excessively
slow connector start, stop, or validation. In those cases, adding more workers
does nothing but move the error to a different worker. I think we can keep the
"adding more workers" comment, if we include another piece of advice for
debugging excessively blocking tasks. If we don't have that other piece of
advice, then advising users to add workers is misleading.
##########
connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/WorkerCoordinator.java:
##########
@@ -267,6 +267,18 @@ public String memberId() {
return JoinGroupRequest.UNKNOWN_MEMBER_ID;
}
+ @Override
+ protected void handlePollTimeoutExpiry() {
Review Comment:
Since we (as maintainers) don't have good insight into what commonly causes
the herder tick thread to block and the poll timeout to fire, we recently added
https://issues.apache.org/jira/browse/KAFKA-15563 to help users debug it
themselves.
It would be nice to integrate with this system to have the heartbeat thread
report what the herder tick thread was blocked on at the time that the poll
timeout happened, as this would report stalling that isn't caused by REST
requests.
The integration is tricky though, because the WorkerCoordinator is (and
should be) unaware of the DistributedHerder. And currently I think the
WorkerCoordinator hides these internal disconnects and reconnects inside of the
poll method. Perhaps we can extend the WorkerRebalanceListener or have a new
error listener to allow the herder to be informed about these errors.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]