C0urante commented on PR #15915: URL: https://github.com/apache/kafka/pull/15915#issuecomment-2148092416
Even under heavy load, ten seconds is an extremely long time for loggers to be set. I think something else may be wrong. It looks like the startup check for distributed workers could be insufficient. By default, we wait to see if a worker's REST API is initialized, which is done by querying the `/connectors` endpoint (see [here](https://github.com/apache/kafka/blob/55d38efcc5505a5a1bddb08ba05f4d923f8050f9/tests/kafkatest/services/connect.py#L115)). However, as was noted in https://github.com/apache/kafka/pull/15249, that check barely does anything aside from ensure that a worker has a valid config and has initialized its REST server. Is it possible that the failures you've seen were caused because workers in the cluster were still starting by the time we issued the logging level adjustment request and waited for it to take effect? If so, we can try first to change the startup mode for this test from `STARTUP_MODE_LISTEN` (the default) to `STARTUP_MODE_JOIN`, which should give stronger guarantees about worker readiness. And as a follow-up, there's [KIP-1017](https://cwiki.apache.org/confluence/display/KAFKA/KIP-1017%3A+Health+check+endpoint+for+Kafka+Connect), which can be used in situations like this to avoid having to use hacks like parsing log files or checking for nonexistent connectors to determine a worker's health and readiness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
