benwtrent commented on issue #13127:
URL: https://github.com/apache/lucene/issues/13127#issuecomment-2264272195
OK, I added a bunch of logging and it seems like the issue is around
`DWPTP#getAndLock`.
I can see the following occurring, new DWPTs being created, each with the
first generation deleteQueue.
```
1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_0,
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds]
1> DWDQ: [ generation: 0 ] id: 245403753 tid: t0 20 getNextSequenceNumber 1
called from stack:...
1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_1,
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds]
1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_5,
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds]
1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_2,
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds]
1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_6,
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds]
1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_3,
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds]
1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_4,
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds]
```
Then reset of the deleteQueue occurs to bump its generation, and correctly,
it sees 7 pending actions, and a base seqNo of 1. All ending up with a maxSeqNo
of 9.
```
1> DWDQ: [ generation: 0 ] id: 245403753 tid: t020 advanceQueue called
from stack with maxSeq 9 lastSeqNo: 1 maxNumPendingOps: 7:
```
Now, we get some more `getNextSequenceNumber` calls from all the other
threads that were waiting. Each called on the zeroth generation:
```
1> DWDQ: [ generation: 0 ] id: 245403753 tid: t525 getNextSequenceNumber 2
called from stack:
1> DWDQ: [ generation: 0 ] id: 245403753 tid: t828 getNextSequenceNumber 3
called from stack:
1> DWDQ: [ generation: 0 ] id: 245403753 tid: t727 getNextSequenceNumber 4
called from stack:
1> DWDQ: [ generation: 0 ] id: 245403753 tid: t626 getNextSequenceNumber 5
called from stack:
1> DWDQ: [ generation: 0 ] id: 245403753 tid: t424 getNextSequenceNumber 6
called from stack:
1> DWDQ: [ generation: 0 ] id: 245403753 tid: t323 getNextSequenceNumber 7
called from stack:
```
However, then `getAndLock` is called again, and returns one of those
previously created threads. Meaning, they haven't been removed yet via flush
and consequently still reference the generation 0 deleteQueue
```
1> getAndLock: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_0,
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds]
1> getAndLock: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_3,
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds]
```
Both of these increment `getNextSequenceNumber` and consequently push us
over the maxSeqNo and this trips the assertion.
Later, we see flush finishing and the threads being removed:
```
1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_2,
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds] checkout to remove
1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_3,
aborted=false, numDocsInRAM=2, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds] checkout to remove
1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_6,
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds] checkout to remove
1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_4,
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds] checkout to remove
1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_5,
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds] checkout to remove
1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_1,
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds] checkout to remove
1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_0,
aborted=false, numDocsInRAM=2, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted
docIds] checkout to remove
```
Here is a branch with my debug logging & this issues replication:
https://github.com/apache/lucene/compare/main...benwtrent:lucene:test-replicate-and-debug-13127?expand=1
I honestly don't know how big a deal it is for a previous deleteQueue
generation returning seqNo from the next generation. But this is definitely a
tricky concurrency issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]