Fabian Bell created KAFKA-20416:
-----------------------------------

             Summary: RocksDB loses entries during broker patches.
                 Key: KAFKA-20416
                 URL: https://issues.apache.org/jira/browse/KAFKA-20416
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 3.9.1
         Environment: MSK with kafka.m7g.2xlarge instances
            Reporter: Fabian Bell


h2. Problem:

We discovered a strange behaviour on our production environment. We use a 
KTable to look up data from a topic we write to.

 
{code:java}
builder.table(topicName, Consumed.with(keySerde, valueSerde), 
Materialized.as(storeName)) {code}
 
When we access the store in the processor, we observed that the store returned 
null values for keys that have non-null entries in the topic that backs the 
KTable after an MSK security patch. We never tombstone an entry in our topic 
nor have a delete retention activated.
This only happens for some of our instances.
We see the following stream logs:
 
{code:java}
Committing task(s) 0_14 failed.
Detected the states of tasks [0_14] are corrupted. Will close the task as dirty 
and re-create and bootstrap from scratch.
Active task(s) got corrupted. Triggering a rebalance.
End offset for changelog our-topic-14 initialized as 16596290.
Restoration in progress for 1 partitions. {our-topic-14: position=0, 
end=16596290, totalRestored=0}
State transition from RUNNING to PARTITIONS_REVOKED
No followup rebalance was requested, resetting the rebalance schedule.
partition revocation took 80 ms.
State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED
State transition from PARTITIONS_ASSIGNED to RUNNING {code}
 
This all happens within a few seconds, and the `Restoration in progress ...` 
log is the only one we can see. A full restoration usually takes like 30 min. 
The error message of the commit failure is 

{code:java}
o.a.k.c.e.TimeoutException: Timeout expired after 60000ms while awaiting 
AddOffsetsToTxn {code}
We can fix this situation by clearing the state directory and forcing a full 
restoration.
h2. Context:

Each instance has its own persistent state directory. The configured state 
directory does  not change.  
Processing Guarantee: exactly_once_v2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to