[
https://issues.apache.org/jira/browse/KAFKA-18957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17948960#comment-17948960
]
Travis Bischel commented on KAFKA-18957:
----------------------------------------
As a point of consolation, I have not ran into this in Kafka 4 with KIP-890
fully implemented.
> Impossible to recover hanging transactions in Kafka 3.9 in KRaft mode using
> EOS
> -------------------------------------------------------------------------------
>
> Key: KAFKA-18957
> URL: https://issues.apache.org/jira/browse/KAFKA-18957
> Project: Kafka
> Issue Type: Bug
> Components: kraft
> Affects Versions: 3.8.1, 3.9.0
> Reporter: Travis Bischel
> Priority: Major
> Attachments: docker-compose.yml, failures.tar.gz, log4j.properties,
> script
>
>
> h1. The problem
> I am able to semi-reliably reproduce hanging unrecoverable transactions on
> Kafka 3.8 using the bitnami container in KRaft mode with a cluster size of
> both 1 and 3. In the failure scenario, a transaction enters and permanently
> hangs in either PrepareCommit or PrepareAbort. To try to fix the hanging
> transaction, I have tried to:
> * Wait until after the transactional timeout. Even after waiting hours just
> to see, the transaction hung.
> * Send a InitProducerID request with the LastProducerID and
> LastProducerEpoch fields set. I receive the CONCURRENT_TRANSACTIONS error
> code back, permanently (i.e. even after retrying this request for minutes)
> * I have tried to initialize a _new_ transaction – that is, send
> InitProducerID with LastProducerID=-1 and LastProducerEpoch=-1. Again, I
> permanently receive CONCURRENT_TRANSACTIONS.
> * I have tried to use kafka-transactions.sh to try to forcefully abort the
> transaction. Nothing happens; the transaction still shows as PrepareCommit or
> PrepareAbort.
> I encountered this error first while trying to add support for KIP-848 to my
> franz-go client. I first thought this was an error with the branch, but to
> double check, I looped my integration tests on the main branch against Kafka
> 3.9 and periodically encountered the same permanent-hang failure. This is a
> new failure; my tests have reliably passed against Kafka for years. I *do*
> encounter this failure with eager consumer rebalancers, but it is less
> common. Where the cooperative-sticky test suite may fail after three or four
> runs, the range test suite (same tests, just using a different balancer)
> fails after ten or so.
> h1. Description of the test suite
> franz-go has an integration test that
> * Starts one transactional producer that produces 500K records to t1 (the
> name in the tests is random), ending and beginning transactions every 10K
> records
> ([here|https://github.com/twmb/franz-go/blob/master/pkg/kgo/txn_test.go#L29])
> * Starts a chain of EOS consumer/producers. The first level (c1) consumes
> from t1 and produces to t2, the second level (c2) consumes from t2 and
> produces to t3, and the third level (c3) consumes from t3 and produces to t4.
> * Starts a second set of EOS consumer/producers. One tries to join c1, but
> aborts the first batch consumed and then leaves the group. Another also joins
> c1 and aborts and leaves after two successful commits. The third joins c2 and
> also aborts and leaves after two successful commits.
> * After another 5 seconds, a second consumer that survives until the end
> joins each of c1, c2, and c3 (to be clear: one new consumer joins c1, another
> new consumer joins c2, a third new consumer joins c3).
> * Another consumer joins c2 and aborts on the first transaction and leaves
> the group. Another consumer joins c1 and aborts after one successful
> transaction and leaves the group.
> * The test waits. All groups stop processing once they have consumed 500K
> records _and then_ timeout after 100ms of waiting for more records.
> * Finally, the test verifies a bunch of things about what was produced &
> seen. For example, all records must be seen in order, they cannot be seen
> twice, and they must be modified at each step in the chain appropriately.
> The chaining setup can be seen
> [here|https://github.com/twmb/franz-go/blob/f30c518d6b727b9169a90b8c10e2127301822a3a/pkg/kgo/helpers_test.go#L482-L513].
> The helpers_test.go file and txn_test.go file are the two to look at to read
> everything that is happening.
> The point is, the test is _somewhat_ rigorous in trying to trigger a few odd
> edge conditions.
> h1. Logs
> I have captured both client and server logs to show the client is sending
> requests in the proper order and to show that the failure is happening
> entirely broker side, not client side (i.e., if my understanding is correct,
> I am not sending requests improperly that would cause this).
> I have attached client and container logs in the .tar.gz attachment below,
> with a README in each directory giving a high level overview of what we're
> seeing in each directory. I'm including the README for the single broker
> setup here:
> * client_15_logs: these are the isolated logs for the client in question
> that entered a hanging state. You can see EndTxn sent on l290 and we receive
> a successful reply. We then send ListTransactions to verify the client has
> exited the Prepare state. We loop issuing ListTransactions for >10s, until we
> stop trying to verify on l400. We try to force bumping the epoch by one to
> fix it in InitProducerID, but that permanently loops receiving a
> CONCURRENT_TRANSACTIONS error. This continues for the rest of the file.
> * client_raw_logs: these are raw logs for all clients combined in the
> TestTxn/cooperative-sticky output.
> * container_logs: these are raw logs for the container, the output of
> `docker container logs`
> * container_logs_for_client: this is container_logs but with all lines that
> do not mention the hung transactional ID deleted. The hung transactional ID
> is 314f54cdee329433ec598bd4584fd7f689b501fc29b3c172d9819be1d3d8ee60. On l89,
> we can see __consumer_offsets being brought into the transaction, and then
> you see an ADD_OFFSETS_TO_TXN request being finished on l92 and then a
> TXN_OFFSET_COMMIT request being finished on l93. On l94, the broker prepares
> to transition from Ongoing to PrepareCommit. On l97, the broker prepares to
> transition from PrepareCommit to CompleteCommit, and on l98 the broker
> replies to the client's END_TXN request with a success. However, immediately
> after it, you can see LIST_TRANSACTIONS being looped with state filters
> PrepareCommit or PrepareAbort, and you can see the transactions never
> transitions out of PrepareCommit. LIST_TRANSACTIONS loops for a while, then
> you see the INIT_PRODUCER_ID loop for a long time, permanently receiving
> CONCURRENT_TRANSACTIONS.
> h1. Files
> * I have attached the docker-compose.yml I used for both a single broker
> setup and a three broker setup. The single broker setup is commented out; the
> setups can be easily swapped.
> * I've attached the log4j.properties that I used to opt into debug logging
> on the request logger as well as the transaction logger.
> * I've attached the script I used to run the tests in a loop. The script
> contains my local directory paths but should be easy enough to modify.
> Essentially, it takes down the cluster and restarts it to have a clean slate
> of logs, changes into my kgo test directory, and then runs the test piping
> client log output to `logs`. *NOTE* The following environment variables are
> required to be set: `KGO_TEST_STABLE_FETCH=true`, `KGO_LOG_LEVEL=debug`, and
> when running with a single broker, `KGO_TEST_RF=1`. The test setup does not
> always hang, but for me, it hangs quite frequently (every few runs).
> * I've attached a .tar.gz file containing two directories, one containing
> client & broker logs for a single broker setup, another containing broker
> only logs for the three broker setup.
> h1. franz-go branch
> The ListTransactions verification is not in franz-go main; I have added it to
> a branch and to not plan to merge it to main. If you are trying to run this
> setup locally, the code is here:
> [https://github.com/twmb/franz-go/pull/928/commits/474557f754df27abe0d417fdfb3b4857d135ded8]
> I can also change stuff on my side to help debug what's going on here.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)