I started looking at the backlog of critical errors in Jira.   It contains
a fully working example of the issue.  While it was reported under version
3.11.3 it appears to be present under 4.0.5.  I don't know the "go"
language but my reading of the script is that, in a single cassandra
configuration, it inserts a value and then immediately updates it.  At the
end of the test the table is read and any record that was not changed is
noted as abnormal.  Most of the time there are no abnormal entries, every
now and then it fails.  Running with 10 inserts it will generate at least
one abnormal result in 8 out of 10 runs.

This test uses the SimpeStrategy replication class with a replication
factor of 1.

I ran the test with <logger name="org.apache.cassandra.service.paxos"
level="TRACE"> in the logback.xml  I ran the test performing 50 inserts and
updates as 10 did not fail consistently when debugging was enabled.
(Sounds like a timing issue).  There were no significant differences
between the logs of the successful and the abnormal runs.

The ordering of the MutationStage-1 and MutationStage-2 execution was
slightly different.  The good run had 28 calls from MutationStage1 and 22
from Mutation-Stage2, the bad run had 30 and 20 respectively.

It looks like the system reports update success but manages to lose the
update later.

Does anyone have any idea how to approach debugging this?

Claude

Reply via email to