Shekhar Prasad Rajak created KAFKA-20381:
--------------------------------------------

             Summary: Producer Transaction Session Heartbeat
                 Key: KAFKA-20381
                 URL: https://issues.apache.org/jira/browse/KAFKA-20381
             Project: Kafka
          Issue Type: New Feature
          Components: group-coordinator, offset manager, producer 
            Reporter: Shekhar Prasad Rajak


When a streaming processing engine (Flink, Spark Structured Streaming, Beam, or 
any custom exactly-once Kafka producer) opens a Kafka transaction and then 
crashes without recovering, the transaction remains in ONGOING state on the 
broker. This blocks the Last Stable Offset (LSO) for all read_committed 
consumers until the broker's transaction.timeout.ms expires.

The current timeout mechanism has a fundamental design flaw: 
transaction.timeout.ms serves two conflicting purposes:

* Correctness bound — "abort if this transaction runs longer than X" (data 
safety)
* Liveness signal — "abort if the producer is dead" (operational safety)
* 
Operators set transaction.timeout.ms high (often 15+ minutes) to accommodate 
long checkpoint intervals and GC pauses. This means a dead producer can block 
all downstream consumers for up to 15 minutes.

Things can go worse when streaming engine is responsible to end transaction but 
crashed: 

{{Phase 1: WRITE
  engine.beginTransaction()
  engine.send(records)       → AddPartitionsToTxn RPC → ONGOING on broker
                               txnStartTimestamp = T0

Phase 2: BARRIER (checkpoint/micro-batch boundary)  
  engine.prepareCommit()     → NO RPC to Kafka (client-side flag only)
                               Kafka broker: still ONGOING, no signal

Phase 3: COMMIT
  engine.commitTransaction() → EndTxn(COMMIT) → broker completes transaction
}}

  [ENGINE CRASH between Phase 1 and Phase 3]
  → No EndTxn ever arrives
  → Broker sees silence, interprets it as... nothing
  → LSO blocked for up to transaction.timeout.ms


After AddPartitionsToTxn is sent and data writing completes, no more RPCs are 
sent until EndTxn. The broker has no way to distinguish between:

* A healthy producer preparing to commit (checkpoint in progress — could take 5 
min)
* A dead producer that will never send EndTxn.

and this situation will block all downstream consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to