Travis Bischel created KAFKA-18957:
--------------------------------------
Summary: Impossible to recover hanging transactions in Kafka 3.9
in KRaft mode using EOS
Key: KAFKA-18957
URL: https://issues.apache.org/jira/browse/KAFKA-18957
Project: Kafka
Issue Type: Bug
Components: kraft
Affects Versions: 3.8.1, 3.9.0
Reporter: Travis Bischel
Attachments: docker-compose.yml, failures.tar.gz, log4j.properties,
script
h1. The problem
I am able to semi-reliably reproduce hanging unrecoverable transactions on
Kafka 3.8 using the bitnami container in KRaft mode with a cluster size of both
1 and 3. In the failure scenario, a transaction enters and permanently hangs in
either PrepareCommit or PrepareAbort. To try to fix the hanging transaction, I
have tried to:
* Wait until after the transactional timeout. Even after waiting hours just to
see, the transaction hung.
* Send a InitProducerID request with the LastProducerID and LastProducerEpoch
fields set. I receive the CONCURRENT_TRANSACTIONS error code back, permanently
(i.e. even after retrying this request for minutes)
* I have tried to initialize a _new_ transaction – that is, send
InitProducerID with LastProducerID=-1 and LastProducerEpoch=-1. Again, I
permanently receive CONCURRENT_TRANSACTIONS.
* I have tried to use kafka-transactions.sh to try to forcefully abort the
transaction. Nothing happens; the transaction still shows as PrepareCommit or
PrepareAbort.
I encountered this error first while trying to add support for KIP-848 to my
franz-go client. I first thought this was an error with the branch, but to
double check, I looped my integration tests on the main branch against Kafka
3.9 and periodically encountered the same permanent-hang failure. This is a new
failure; my tests have reliably passed against Kafka for years. I *do*
encounter this failure with eager consumer rebalancers, but it is less common.
Where the cooperative-sticky test suite may fail after three or four runs, the
range test suite (same tests, just using a different balancer) fails after ten
or so.
h1. Description of the test suite
franz-go has an integration test that
* Starts one transactional producer that produces 500K records to t1 (the name
in the tests is random), ending and beginning transactions every 10K records
([here|https://github.com/twmb/franz-go/blob/master/pkg/kgo/txn_test.go#L29])
* Starts a chain of EOS consumer/producers. The first level (c1) consumes from
t1 and produces to t2, the second level (c2) consumes from t2 and produces to
t3, and the third level (c3) consumes from t3 and produces to t4.
* Starts a second set of EOS consumer/producers. One tries to join c1, but
aborts the first batch consumed and then leaves the group. Another also joins
c1 and aborts and leaves after two successful commits. The third joins c2 and
also aborts and leaves after two successful commits.
* After another 5 seconds, a second consumer that survives until the end joins
each of c1, c2, and c3 (to be clear: one new consumer joins c1, another new
consumer joins c2, a third new consumer joins c3).
* Another consumer joins c2 and aborts on the first transaction and leaves the
group. Another consumer joins c1 and aborts after one successful transaction
and leaves the group.
* The test waits. All groups stop processing once they have consumed 500K
records _and then_ timeout after 100ms of waiting for more records.
* Finally, the test verifies a bunch of things about what was produced & seen.
For example, all records must be seen in order, they cannot be seen twice, and
they must be modified at each step in the chain appropriately.
The chaining setup can be seen
[here|https://github.com/twmb/franz-go/blob/f30c518d6b727b9169a90b8c10e2127301822a3a/pkg/kgo/helpers_test.go#L482-L513].
The helpers_test.go file and txn_test.go file are the two to look at to read
everything that is happening.
The point is, the test is _somewhat_ rigorous in trying to trigger a few odd
edge conditions.
h1. Logs
I have captured both client and server logs to show the client is sending
requests in the proper order and to show that the failure is happening entirely
broker side, not client side (i.e., if my understanding is correct, I am not
sending requests improperly that would cause this).
I have attached client and container logs in the .tar.gz attachment below, with
a README in each directory giving a high level overview of what we're seeing in
each directory. I'm including the README for the single broker setup here:
* client_15_logs: these are the isolated logs for the client in question that
entered a hanging state. You can see EndTxn sent on l290 and we receive a
successful reply. We then send ListTransactions to verify the client has exited
the Prepare state. We loop issuing ListTransactions for >10s, until we stop
trying to verify on l400. We try to force bumping the epoch by one to fix it in
InitProducerID, but that permanently loops receiving a CONCURRENT_TRANSACTIONS
error. This continues for the rest of the file.
* client_raw_logs: these are raw logs for all clients combined in the
TestTxn/cooperative-sticky output.
* container_logs: these are raw logs for the container, the output of `docker
container logs`
* container_logs_for_client: this is container_logs but with all lines that do
not mention the hung transactional ID deleted. The hung transactional ID is
314f54cdee329433ec598bd4584fd7f689b501fc29b3c172d9819be1d3d8ee60. On l89, we
can see __consumer_offsets being brought into the transaction, and then you see
an ADD_OFFSETS_TO_TXN request being finished on l92 and then a
TXN_OFFSET_COMMIT request being finished on l93. On l94, the broker prepares to
transition from Ongoing to PrepareCommit. On l97, the broker prepares to
transition from PrepareCommit to CompleteCommit, and on l98 the broker replies
to the client's END_TXN request with a success. However, immediately after it,
you can see LIST_TRANSACTIONS being looped with state filters PrepareCommit or
PrepareAbort, and you can see the transactions never transitions out of
PrepareCommit. LIST_TRANSACTIONS loops for a while, then you see the
INIT_PRODUCER_ID loop for a long time, permanently receiving
CONCURRENT_TRANSACTIONS.
h1. Files
* I have attached the docker-compose.yml I used for both a single broker setup
and a three broker setup. The single broker setup is commented out; the setups
can be easily swapped.
* I've attached the log4j.properties that I used to opt into debug logging on
the request logger as well as the transaction logger.
* I've attached the script I used to run the tests in a loop. The script
contains my local directory paths but should be easy enough to modify.
Essentially, it takes down the cluster and restarts it to have a clean slate of
logs, changes into my kgo test directory, and then runs the test piping client
log output to `logs`. *NOTE* The following environment variables are required
to be set: `KGO_TEST_STABLE_FETCH=true`, `KGO_LOG_LEVEL=debug`, and when
running with a single broker, `KGO_TEST_RF=1`. The test setup does not always
hang, but for me, it hangs quite frequently (every few runs).
* I've attached a .tar.gz file containing two directories, one containing
client & broker logs for a single broker setup, another containing broker only
logs for the three broker setup.
h1. franz-go branch
The ListTransactions verification is not in franz-go main; I have added it to a
branch and to not plan to merge it to main. If you are trying to run this setup
locally, the code is here:
[https://github.com/twmb/franz-go/pull/928/commits/474557f754df27abe0d417fdfb3b4857d135ded8]
I can also change stuff on my side to help debug what's going on here.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)