[
https://issues.apache.org/jira/browse/KAFKA-9803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Guozhang Wang updated KAFKA-9803:
---------------------------------
Component/s: (was: consumer)
producer
> Allow producers to recover gracefully from transaction timeouts
> ---------------------------------------------------------------
>
> Key: KAFKA-9803
> URL: https://issues.apache.org/jira/browse/KAFKA-9803
> Project: Kafka
> Issue Type: Improvement
> Components: producer , streams
> Reporter: Jason Gustafson
> Assignee: Boyang Chen
> Priority: Major
> Labels: needs-kip
>
> Transaction timeouts are detected by the transaction coordinator. When the
> coordinator detects a timeout, it bumps the producer epoch and aborts the
> transaction. The epoch bump is necessary in order to prevent the current
> producer from being able to begin writing to a new transaction which was not
> started through the coordinator.
> Transactions may also be aborted if a new producer with the same
> `transactional.id` starts up. Similarly this results in an epoch bump.
> Currently the coordinator does not distinguish these two cases. Both will end
> up as a `ProducerFencedException`, which means the producer needs to shut
> itself down.
> We can improve this with the new APIs from KIP-360. When the coordinator
> times out a transaction, it can remember that fact and allow the existing
> producer to claim the bumped epoch and continue. Roughly the logic would work
> like this:
> 1. When a transaction times out, set lastProducerEpoch to the current epoch
> and do the normal bump.
> 2. Any transactional requests from the old epoch result in a new
> TRANSACTION_TIMED_OUT error code, which is propagated to the application.
> 3. The producer recovers by sending InitProducerId with the current epoch.
> The coordinator returns the bumped epoch.
> One issue that needs to be addressed is how to handle INVALID_PRODUCER_EPOCH
> from Produce requests. Partition leaders will not generally know if a bumped
> epoch was the result of a timed out transaction or a fenced producer.
> Possibly the producer can treat these errors as abortable when they come from
> Produce responses. In that case, the user would try to abort the transaction
> and then we can see if it was due to a timeout or otherwise.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)