[ https://issues.apache.org/jira/browse/SOLR-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206859#comment-17206859 ]
Erick Erickson commented on SOLR-14861: --------------------------------------- Some notes from the dev list: Erick Erickson: Generically, we need a mechanism that, when we shut Solr down we 1> stop any new requests from being processed. IMO they should be rejected immediately 2> wait for all in-flight operations to complete. This could get tricky if one of the operations is, say, optimize. 3> then shut down. Then perhaps rework the test harness to use that mechanism rather than call CoreContainer.shutdown() directly. That said, I don’t have a clue how to make that happen. David Smiley's reply: Definitely makes sense to me -- those 3 phases of shutdown. That is what I think CoreContainer.shutdown() should itself do. Flag that shutdown has been requested, then wait sufficiently aided by a Semaphore perhaps (note all requests would need to take/release this), then actually do shutdown. > CoreContainer shutdown needs to be aware of other ongoing operations and wait > until they're complete > ---------------------------------------------------------------------------------------------------- > > Key: SOLR-14861 > URL: https://issues.apache.org/jira/browse/SOLR-14861 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Erick Erickson > Assignee: Erick Erickson > Priority: Major > Attachments: SOLR-14861.patch > > > Noble and I are trying to get to the bottom of the TestBulkSchemaConcurrent > failures and found what looks like a glaring gap in how > CoreContainer.shutdown operates. I don't know the impact on production since > we're shutting down anyway, but I think this is responsible for the errors in > TestBulkSchemaConcurrent and likely behind others, especially any other test > that fails intermittently that involves core reloads, including and > especially any tests that exercise managed schema. > We have clear evidence of this sequence: > 1> some CoreContainer.reloads come in and get _partway_ through, in > particular past the test at the top where CoreContainer.reload() throws an > AlreadyClosed exception if (isShutdown). > 2> Some CoreContainer.shutdown() threads get some processing time before the > reloads in <1> are finished. > 3> the threads in <1> pick back up and go wonky. I suspect that there are a > number of different things that could be going wrong here depending on how > far through CoreContainer.shutdown() gets that pop out in different ways. > Since it's my shift (Noble has to sleep sometime), I put some crude locking > in just to test the idea; incrementing an AtomicInteger on entry to > CoreContainer.reload then decrementing it at the end, and spinning in > CoreContainer.shutdown() until the AtomicInteger was back to zero. With that > in place, 100 runs and no errors whereas before I could never get even 10 > runs to finish without an error. This is not a proper fix at all, and the way > it's currently running there are still possible race conditions, just much > smaller windows. And I suspect it risks spinning forever. But it's enough to > make me believe I finally understand what's happening. > I also suspect that reload is more sensitive than most operations on a core > due to the fact that it runs for a long time, but I assume other operations > have the same potential. Shouldn't CoreContainer.shutDown() wait until no > other operations are in flight? > On a quick scan of CoreContainer, there are actually few places where we even > check for isShutdown, I suspect the places we do are ad-hoc that we've found > by trial-and-error when tests fail. We need a design rather than hit-or-miss > hacking. > I think that isShutdown should be replaced with something more robust. What > that is IDK quite yet because I've been hammering at this long enough and I > need a break. > This is consistent with another observation about this particular test. If > there's sleep at the end, it wouldn't fail; all the reloads get a chance to > finish before anything was shut down. > An open question how much this matters to production systems. In the testing > case, bunches of these reloads are issued then we immediately end the test > and start shutting things down. It needs to be fixed if we're going to cut > down on test failures though. Besides, it's just wrong ;) > Assigning to myself to track. I'd be perfectly happy, now that Noble and I > have done the hard work, for someone to swoop in and take the credit for > fixing it ;) > gradlew beast -Ptests.dups=10 --tests TestBulkSchemaConcurrent > always fails for me on current code without my hack... -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org