I started looking at the backlog of critical errors in Jira. It contains a fully working example of the issue. While it was reported under version 3.11.3 it appears to be present under 4.0.5. I don't know the "go" language but my reading of the script is that, in a single cassandra configuration, it inserts a value and then immediately updates it. At the end of the test the table is read and any record that was not changed is noted as abnormal. Most of the time there are no abnormal entries, every now and then it fails. Running with 10 inserts it will generate at least one abnormal result in 8 out of 10 runs.
This test uses the SimpeStrategy replication class with a replication factor of 1. I ran the test with <logger name="org.apache.cassandra.service.paxos" level="TRACE"> in the logback.xml I ran the test performing 50 inserts and updates as 10 did not fail consistently when debugging was enabled. (Sounds like a timing issue). There were no significant differences between the logs of the successful and the abnormal runs. The ordering of the MutationStage-1 and MutationStage-2 execution was slightly different. The good run had 28 calls from MutationStage1 and 22 from Mutation-Stage2, the bad run had 30 and 20 respectively. It looks like the system reports update success but manages to lose the update later. Does anyone have any idea how to approach debugging this? Claude