[
https://issues.apache.org/jira/browse/SOLR-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206859#comment-17206859
]
Erick Erickson commented on SOLR-14861:
---------------------------------------
Some notes from the dev list:
Erick Erickson:
Generically, we need a mechanism that, when we shut Solr down we
1> stop any new requests from being processed. IMO they should be rejected
immediately
2> wait for all in-flight operations to complete. This could get tricky if one
of the operations is, say, optimize.
3> then shut down.
Then perhaps rework the test harness to use that mechanism rather than call
CoreContainer.shutdown() directly.
That said, I don’t have a clue how to make that happen.
David Smiley's reply:
Definitely makes sense to me -- those 3 phases of shutdown. That is what I
think CoreContainer.shutdown() should itself do. Flag that shutdown has been
requested, then wait sufficiently aided by a Semaphore perhaps (note all
requests would need to take/release this), then actually do shutdown.
> CoreContainer shutdown needs to be aware of other ongoing operations and wait
> until they're complete
> ----------------------------------------------------------------------------------------------------
>
> Key: SOLR-14861
> URL: https://issues.apache.org/jira/browse/SOLR-14861
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Priority: Major
> Attachments: SOLR-14861.patch
>
>
> Noble and I are trying to get to the bottom of the TestBulkSchemaConcurrent
> failures and found what looks like a glaring gap in how
> CoreContainer.shutdown operates. I don't know the impact on production since
> we're shutting down anyway, but I think this is responsible for the errors in
> TestBulkSchemaConcurrent and likely behind others, especially any other test
> that fails intermittently that involves core reloads, including and
> especially any tests that exercise managed schema.
> We have clear evidence of this sequence:
> 1> some CoreContainer.reloads come in and get _partway_ through, in
> particular past the test at the top where CoreContainer.reload() throws an
> AlreadyClosed exception if (isShutdown).
> 2> Some CoreContainer.shutdown() threads get some processing time before the
> reloads in <1> are finished.
> 3> the threads in <1> pick back up and go wonky. I suspect that there are a
> number of different things that could be going wrong here depending on how
> far through CoreContainer.shutdown() gets that pop out in different ways.
> Since it's my shift (Noble has to sleep sometime), I put some crude locking
> in just to test the idea; incrementing an AtomicInteger on entry to
> CoreContainer.reload then decrementing it at the end, and spinning in
> CoreContainer.shutdown() until the AtomicInteger was back to zero. With that
> in place, 100 runs and no errors whereas before I could never get even 10
> runs to finish without an error. This is not a proper fix at all, and the way
> it's currently running there are still possible race conditions, just much
> smaller windows. And I suspect it risks spinning forever. But it's enough to
> make me believe I finally understand what's happening.
> I also suspect that reload is more sensitive than most operations on a core
> due to the fact that it runs for a long time, but I assume other operations
> have the same potential. Shouldn't CoreContainer.shutDown() wait until no
> other operations are in flight?
> On a quick scan of CoreContainer, there are actually few places where we even
> check for isShutdown, I suspect the places we do are ad-hoc that we've found
> by trial-and-error when tests fail. We need a design rather than hit-or-miss
> hacking.
> I think that isShutdown should be replaced with something more robust. What
> that is IDK quite yet because I've been hammering at this long enough and I
> need a break.
> This is consistent with another observation about this particular test. If
> there's sleep at the end, it wouldn't fail; all the reloads get a chance to
> finish before anything was shut down.
> An open question how much this matters to production systems. In the testing
> case, bunches of these reloads are issued then we immediately end the test
> and start shutting things down. It needs to be fixed if we're going to cut
> down on test failures though. Besides, it's just wrong ;)
> Assigning to myself to track. I'd be perfectly happy, now that Noble and I
> have done the hard work, for someone to swoop in and take the credit for
> fixing it ;)
> gradlew beast -Ptests.dups=10 --tests TestBulkSchemaConcurrent
> always fails for me on current code without my hack...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]