benwtrent commented on issue #13127: URL: https://github.com/apache/lucene/issues/13127#issuecomment-2264272195
OK, I added a bunch of logging and it seems like the issue is around `DWPTP#getAndLock`. I can see the following occurring, new DWPTs being created, each with the first generation deleteQueue. ``` 1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_0, aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] 1> DWDQ: [ generation: 0 ] id: 245403753 tid: t0 20 getNextSequenceNumber 1 called from stack:... 1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_1, aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] 1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_5, aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] 1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_2, aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] 1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_6, aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] 1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_3, aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] 1> new Writer: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_4, aborted=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] ``` Then reset of the deleteQueue occurs to bump its generation, and correctly, it sees 7 pending actions, and a base seqNo of 1. All ending up with a maxSeqNo of 9. ``` 1> DWDQ: [ generation: 0 ] id: 245403753 tid: t020 advanceQueue called from stack with maxSeq 9 lastSeqNo: 1 maxNumPendingOps: 7: ``` Now, we get some more `getNextSequenceNumber` calls from all the other threads that were waiting. Each called on the zeroth generation: ``` 1> DWDQ: [ generation: 0 ] id: 245403753 tid: t525 getNextSequenceNumber 2 called from stack: 1> DWDQ: [ generation: 0 ] id: 245403753 tid: t828 getNextSequenceNumber 3 called from stack: 1> DWDQ: [ generation: 0 ] id: 245403753 tid: t727 getNextSequenceNumber 4 called from stack: 1> DWDQ: [ generation: 0 ] id: 245403753 tid: t626 getNextSequenceNumber 5 called from stack: 1> DWDQ: [ generation: 0 ] id: 245403753 tid: t424 getNextSequenceNumber 6 called from stack: 1> DWDQ: [ generation: 0 ] id: 245403753 tid: t323 getNextSequenceNumber 7 called from stack: ``` However, then `getAndLock` is called again, and returns one of those previously created threads. Meaning, they haven't been removed yet via flush and consequently still reference the generation 0 deleteQueue ``` 1> getAndLock: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_0, aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] 1> getAndLock: DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_3, aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] ``` Both of these increment `getNextSequenceNumber` and consequently push us over the maxSeqNo and this trips the assertion. Later, we see flush finishing and the threads being removed: ``` 1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_2, aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] checkout to remove 1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_3, aborted=false, numDocsInRAM=2, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] checkout to remove 1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_6, aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] checkout to remove 1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_4, aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] checkout to remove 1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_5, aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] checkout to remove 1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_1, aborted=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] checkout to remove 1> DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_0, aborted=false, numDocsInRAM=2, deleteQueue=DWDQ: [ generation: 0 ], 0 deleted docIds] checkout to remove ``` Here is a branch with my debug logging & this issues replication: https://github.com/apache/lucene/compare/main...benwtrent:lucene:test-replicate-and-debug-13127?expand=1 I honestly don't know how big a deal it is for a previous deleteQueue generation returning seqNo from the next generation. But this is definitely a tricky concurrency issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org