[
https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yannick updated KAFKA-9212:
---------------------------
Description:
When running Kafka connect s3 sink connector ( confluent 5.3.0), after one
broker got restarted (leaderEpoch updated at this point), the connect worker
crashed with the following error :
[2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls]
Uncaught exception in herder work thread, exiting:
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by
times in 30003ms
After investigation, it seems it's because it got fenced when sending
ListOffsetRequest in loop and then got timed out , as follows :
[2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3,
groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest,
replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1,
maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}},
isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4
rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)
[2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3,
groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0
failed due to FENCED_LEADER_EPOCH, retrying.
(org.apache.kafka.clients.consumer.internals.Fetcher:985)
The above happens multiple times until timeout.
According to the debugs, the consumer always get a leaderEpoch of 1 for this
topic when starting up :
[2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3,
groupId=connect-ls] Updating last seen epoch from null to 1 for partition
connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
But according to our brokers log, the leaderEpoch should be 2, as follows :
[2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4]
connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader
Epoch was: 1 (kafka.cluster.Partition)
This make impossible to restart the worker as it will always get fenced and
then finally timeout.
It is also impossible to consume with a 2.3 kafka-console-consumer as follows :
kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic
connect_ls_config --from-beginning
the above will just hang forever ( which is not expected cause there is data)
and we can see those debug messages :
[2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1,
groupId=console-consumer-3844] Attempt to fetch offsets for partition
connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying.
(org.apache.kafka.clients.consumer.internals.Fetcher)
Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we can
consume without problem ( must be the way kafkacat is consuming which is
different somehow):
kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
was:
When running Kafka connect s3 sink connector ( confluent 5.3.0), after one
broker got restarted (leaderEpoch updated at this point), the connect worker
crashed with the following error :
[2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls]
Uncaught exception in herder work thread, exiting:
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by
times in 30003ms
After investigation, it seems it's because it got fenced when sending
ListOffsetRequest in loop and then got timed out , as follows :
[2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3,
groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest,
replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1,
maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}},
isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4
rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)
[2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3,
groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0
failed due to FENCED_LEADER_EPOCH, retrying.
(org.apache.kafka.clients.consumer.internals.Fetcher:985)
The above happens multiple times until timeout.
According to the debugs, the consumer always get a leaderEpoch of 1 for this
topic when starting up :
[2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3,
groupId=connect-ls] Updating last seen epoch from null to 1 for partition
connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
But according to our brokers log, the leaderEpoch should be 2, as follows :
[2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4]
connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader
Epoch was: 1 (kafka.cluster.Partition)
This make impossible to restart the worker as it will always get fenced and
then finally timeout.
It is also impossible to consumer with a 2.3 kafka-console-consumer as follows
:
kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic
connect_ls_config --from-beginning
the above will just hang forever ( which is not expected cause there is data)
Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we can
consume without problem ( must be the way kafkacat is consuming which is
different somehow):
kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
> Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
> ------------------------------------------------------------------
>
> Key: KAFKA-9212
> URL: https://issues.apache.org/jira/browse/KAFKA-9212
> Project: Kafka
> Issue Type: Bug
> Components: consumer, offset manager
> Affects Versions: 2.3.0
> Environment: Linux
> Reporter: Yannick
> Priority: Major
>
> When running Kafka connect s3 sink connector ( confluent 5.3.0), after one
> broker got restarted (leaderEpoch updated at this point), the connect worker
> crashed with the following error :
> [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1,
> groupId=connect-ls] Uncaught exception in herder work thread, exiting:
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
> org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by
> times in 30003ms
>
> After investigation, it seems it's because it got fenced when sending
> ListOffsetRequest in loop and then got timed out , as follows :
> [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3,
> groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest,
> replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1,
> maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}},
> isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4
> rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)
> [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3,
> groupId=connect-ls] Attempt to fetch offsets for partition
> connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying.
> (org.apache.kafka.clients.consumer.internals.Fetcher:985)
>
> The above happens multiple times until timeout.
>
> According to the debugs, the consumer always get a leaderEpoch of 1 for this
> topic when starting up :
>
> [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3,
> groupId=connect-ls] Updating last seen epoch from null to 1 for partition
> connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
>
>
> But according to our brokers log, the leaderEpoch should be 2, as follows :
>
> [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4]
> connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader
> Epoch was: 1 (kafka.cluster.Partition)
>
>
> This make impossible to restart the worker as it will always get fenced and
> then finally timeout.
>
> It is also impossible to consume with a 2.3 kafka-console-consumer as
> follows :
>
> kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic
> connect_ls_config --from-beginning
>
> the above will just hang forever ( which is not expected cause there is
> data) and we can see those debug messages :
> [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1,
> groupId=console-consumer-3844] Attempt to fetch offsets for partition
> connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying.
> (org.apache.kafka.clients.consumer.internals.Fetcher)
>
>
> Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we
> can consume without problem ( must be the way kafkacat is consuming which is
> different somehow):
>
> kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)