[
https://issues.apache.org/jira/browse/KAFKA-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jennifer Thompson updated KAFKA-7845:
-------------------------------------
Description:
When one of our Kafka brokers dies and a new one replaces it (via an aws ASG),
the clients that publish to Kafka still try to publish to the old brokers.
We see errors like
{code:java}
2019-01-18 20:16:16 WARN NetworkClient:721 - [Producer clientId=producer-1]
Connection to node 2 (/10.130.98.111:9092) could not be established. Broker may
not be available.
2019-01-18 20:19:09 WARN Sender:596 - [Producer clientId=producer-1] Got error
produce response with correlation id 3414 on topic-partition aa.pga-2, retrying
(4 attempts left). Error: NOT_LEADER_FOR_PARTITION
2019-01-18 20:19:09 WARN Sender:641 - [Producer clientId=producer-1] Received
invalid metadata error in produce request on partition aa.pga-2 due to
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition.. Going to request metadata update now
2019-01-18 20:21:19 WARN NetworkClient:721 - [Producer clientId=producer-1]
Connection to node 2 (/10.130.98.111:9092) could not be established. Broker may
not be available.
2019-01-18 20:21:50 ERROR ProducerBatch:233 - Error executing user-provided
callback on message for topic-partition 'aa.test-liz-0'{code}
and
{code:java}
[2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1}
Failed to flush, timed out while waiting for producer to flush outstanding 27
messages (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1}
Failed to commit offsets
(org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter)
{code}
The ip address referenced is for the broker that died. We have Kafka Manager
running as well, and that picks up the new broker.
We already set
{code:java}
networkaddress.cache.ttl{code}
in
{code:java}
jre/lib/security/java.security{code}
Our java version is "Java(TM) SE Runtime Environment (build 1.8.0_192-b12)"
This started happening after we upgraded to 2.1. When had Kafka 1.1, brokers
could failover without a problem.
One thing that might be considered unusual about our deployment is that we
reuse the same broker id and EBS volume for the new broker, so that partitions
do not have to be reassigned.
was:
When one of our Kafka brokers dies and a new one replaces it (via an aws ASG),
the clients that publish to Kafka still try to publish to the old brokers.
We see errors like
{code:java}
2019-01-18 20:16:16 WARN NetworkClient:721 - [Producer clientId=producer-1]
Connection to node 2 (/10.130.98.111:9092) could not be established. Broker may
not be available.
2019-01-18 20:19:09 WARN Sender:596 - [Producer clientId=producer-1] Got error
produce response with correlation id 3414 on topic-partition aa.pga-2, retrying
(4 attempts left). Error: NOT_LEADER_FOR_PARTITION
2019-01-18 20:19:09 WARN Sender:641 - [Producer clientId=producer-1] Received
invalid metadata error in produce request on partition aa.pga-2 due to
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition.. Going to request metadata update now
2019-01-18 20:21:19 WARN NetworkClient:721 - [Producer clientId=producer-1]
Connection to node 2 (/10.130.98.111:9092) could not be established. Broker may
not be available.
2019-01-18 20:21:50 ERROR ProducerBatch:233 - Error executing user-provided
callback on message for topic-partition 'aa.test-liz-0'{code}
and
{code:java}
[2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1}
Failed to flush, timed out while waiting for producer to flush outstanding 27
messages (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1}
Failed to commit offsets
(org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter)
{code}
The ip address referenced is for the broker that died. We have Kafka Manager
running as well, and that picks up the new broker.
We already set
{code}networkaddress.cache.ttl{code} in
{code}jre/lib/security/java.security{code}
This started happening after we upgraded to 2.1. When had Kafka 1.1, brokers
could failover without a problem.
One thing that might be considered unusual about our deployment is that we
reuse the same broker id and EBS volume for the new broker, so that partitions
do not have to be reassigned.
> Kafka clients do not re-resolve ips when a broker is replaced.
> --------------------------------------------------------------
>
> Key: KAFKA-7845
> URL: https://issues.apache.org/jira/browse/KAFKA-7845
> Project: Kafka
> Issue Type: Bug
> Components: clients
> Affects Versions: 2.1.0
> Reporter: Jennifer Thompson
> Priority: Major
>
> When one of our Kafka brokers dies and a new one replaces it (via an aws
> ASG), the clients that publish to Kafka still try to publish to the old
> brokers.
> We see errors like
> {code:java}
> 2019-01-18 20:16:16 WARN NetworkClient:721 - [Producer clientId=producer-1]
> Connection to node 2 (/10.130.98.111:9092) could not be established. Broker
> may not be available.
> 2019-01-18 20:19:09 WARN Sender:596 - [Producer clientId=producer-1] Got
> error produce response with correlation id 3414 on topic-partition aa.pga-2,
> retrying (4 attempts left). Error: NOT_LEADER_FOR_PARTITION
> 2019-01-18 20:19:09 WARN Sender:641 - [Producer clientId=producer-1] Received
> invalid metadata error in produce request on partition aa.pga-2 due to
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
> not the leader for that topic-partition.. Going to request metadata update now
> 2019-01-18 20:21:19 WARN NetworkClient:721 - [Producer clientId=producer-1]
> Connection to node 2 (/10.130.98.111:9092) could not be established. Broker
> may not be available.
> 2019-01-18 20:21:50 ERROR ProducerBatch:233 - Error executing user-provided
> callback on message for topic-partition 'aa.test-liz-0'{code}
> and
> {code:java}
> [2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1}
> Failed to flush, timed out while waiting for producer to flush outstanding 27
> messages (org.apache.kafka.connect.runtime.WorkerSourceTask)
> [2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1}
> Failed to commit offsets
> (org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter)
> {code}
> The ip address referenced is for the broker that died. We have Kafka Manager
> running as well, and that picks up the new broker.
> We already set
> {code:java}
> networkaddress.cache.ttl{code}
> in
> {code:java}
> jre/lib/security/java.security{code}
> Our java version is "Java(TM) SE Runtime Environment (build 1.8.0_192-b12)"
> This started happening after we upgraded to 2.1. When had Kafka 1.1, brokers
> could failover without a problem.
> One thing that might be considered unusual about our deployment is that we
> reuse the same broker id and EBS volume for the new broker, so that
> partitions do not have to be reassigned.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)