benwtrent commented on issue #13127:
URL: https://github.com/apache/lucene/issues/13127#issuecomment-2264272195

   OK, I added a bunch of logging and it seems like the issue is around 
`DWPTP#getAndLock`. 
   
   I can see the following occurring, new DWPTs being created, each with the 
first generation deleteQueue. 
   
   ```
     1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_0, 
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds]
   1> DWDQ: [ generation: 0 ] id: 245403753 tid: t0 20 getNextSequenceNumber 1 
called from stack:...
   1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_1, 
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds]
     1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_5, 
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds]
     1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_2, 
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds]
     1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_6, 
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds]
     1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_3, 
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds]
     1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_4, 
aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds]
   ```
   
   Then reset of the deleteQueue occurs to bump its generation, and correctly, 
it sees 7 pending actions, and a base seqNo of 1. All ending up with a maxSeqNo 
of 9.
   
   ```
     1> DWDQ: [ generation: 0 ] id: 245403753 tid: t020 advanceQueue called 
from stack with maxSeq 9 lastSeqNo: 1 maxNumPendingOps: 7:
   ```
   
   Now, we get some more `getNextSequenceNumber` calls from all the other 
threads that were waiting. Each called on the zeroth generation:
   
   ```
     1> DWDQ: [ generation: 0 ] id: 245403753 tid: t525 getNextSequenceNumber 2 
called from stack:
     1> DWDQ: [ generation: 0 ] id: 245403753 tid: t828 getNextSequenceNumber 3 
called from stack:
     1> DWDQ: [ generation: 0 ] id: 245403753 tid: t727 getNextSequenceNumber 4 
called from stack:
     1> DWDQ: [ generation: 0 ] id: 245403753 tid: t626 getNextSequenceNumber 5 
called from stack:
     1> DWDQ: [ generation: 0 ] id: 245403753 tid: t424 getNextSequenceNumber 6 
called from stack:
     1> DWDQ: [ generation: 0 ] id: 245403753 tid: t323 getNextSequenceNumber 7 
called from stack:
   ```
   
   However, then `getAndLock` is called again, and returns one of those 
previously created threads. Meaning, they haven't been removed yet via flush 
and consequently still reference the generation 0 deleteQueue
   
   ```
     1> getAndLock: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_0, 
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds]
     1> getAndLock: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_3, 
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds]
   ```
   Both of these increment `getNextSequenceNumber` and consequently push us 
over the maxSeqNo and this trips the assertion.
   
   Later, we see flush finishing and the threads being removed:
   ```
   1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_2, 
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds] checkout to remove
     1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_3, 
aborted=false, numDocsInRAM=2, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds] checkout to remove
     1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_6, 
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds] checkout to remove
     1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_4, 
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds] checkout to remove
     1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_5, 
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds] checkout to remove
     1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_1, 
aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds] checkout to remove
     1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_0, 
aborted=false, numDocsInRAM=2, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted 
docIds] checkout to remove
   ```
   
   Here is a branch with my debug logging & this issues replication: 
https://github.com/apache/lucene/compare/main...benwtrent:lucene:test-replicate-and-debug-13127?expand=1
   
   I honestly don't know how big a deal it is for a previous deleteQueue 
generation returning seqNo from the next generation. But this is definitely a 
tricky concurrency issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to