[ https://issues.apache.org/jira/browse/LUCENE-9337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089085#comment-17089085 ]
Simon Willnauer commented on LUCENE-9337: ----------------------------------------- here is a PR https://github.com/apache/lucene-solr/pull/1443/ > CMS might miss to pickup pending merges when maxMergeCount changes while > merges are running > ------------------------------------------------------------------------------------------- > > Key: LUCENE-9337 > URL: https://issues.apache.org/jira/browse/LUCENE-9337 > Project: Lucene - Core > Issue Type: Bug > Reporter: Simon Willnauer > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We found a test hanging on an IW#forceMerge on elastics CI on an innocent > looking test: > {noformat} > 14:52:06 [junit4] 2> at > java.base@11.0.2/java.lang.Object.wait(Native Method) > 14:52:06 [junit4] 2> at > app//org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4722) > 14:52:06 [junit4] 2> at > app//org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2034) > 14:52:06 [junit4] 2> at > app//org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1960) > 14:52:06 [junit4] 2> at > app//org.apache.lucene.index.RandomIndexWriter.forceMerge(RandomIndexWriter.java:500) > 14:52:06 [junit4] 2> at > app//org.apache.lucene.index.BaseDocValuesFormatTestCase.doTestNumericsVsStoredFields(BaseDocValuesFormatTestCase.java:1301) > 14:52:06 [junit4] 2> at > app//org.apache.lucene.index.BaseDocValuesFormatTestCase.doTestNumericsVsStoredFields(BaseDocValuesFormatTestCase.java:1258) > 14:52:06 [junit4] 2> at > app//org.apache.lucene.index.BaseDocValuesFormatTestCase.testZeroOrMin(BaseDocValuesFormatTestCase.java:2423) > {noformat} > after spending quite some time trying to reproduce without any luck I tried > to review all involved code again to understand possible threading issues. > What I found is that if maxMergeCount gets changed on CMS while there are > merges running and the forceMerge gets kicked off at the same time the > running merges return we might miss to pick up the final pending merges which > causes the forceMerge to hang. I was able to build a test-case that is very > likely to fail on every run without the fix. While I think this is not a > critical bug from how likely it is to happen in practice, if it happens it's > basically a deadlock unless the IW sees any other change that kicks off a > merge. > Lemme walk through the issue. Lets say we have 1 pending merge and 2 merge > threads running on CMS. The forceMerge is already waiting for merges to > finish. Once the first merge thread finishes we try to check if we need to > stall it > [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L580] > but since it's a merge thread we return > [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L596] > and don't pick up another merge > [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L526]. > > Now the second running merge thread checks the condition > [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L580] > while the first one is finishing up. But before it can actually update the > internal datastructures > [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L688] > it releases the CMS lock and the calculation in the stall method on how many > threads are running is off causing the second thread also to step out of the > maybeStall method not picking up the pending merge. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org