[ 
https://issues.apache.org/jira/browse/SOLR-14750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178243#comment-17178243
 ] 

Erick Erickson commented on SOLR-14750:
---------------------------------------

This is getting weird. The patch (yeah, David, I'll put up a PR sometime, but 
see below why there's no point now) runs the test fine, but then a bunch of 
other tests fail. After a lot of messing around I found:

Up near the top of CoreContainer.reload() around line 1596 are these lines:
{code:java}
      if(coreId != null && core.uniqueId != coreId) {
        //trying to reload an already unloaded core
        return;
      }
{code}
I misread "core" for "coreId" and in the rearranged code made "core == null" 
impossible so took the check out:
{code:java}
      if(core.uniqueId != coreId) {
        //trying to reload an already unloaded core
        return;
      }
{code}
So this test ran fine, but other tests failed. Even in the rearranged code, 
when I put the check back other tests succeeded bu this test failed. I still 
think there's a possible race condition here and the code should be rearranged 
but until I find out why this check causes the test to fail sometime but is 
necessary for other tests to succeed, making any changes would potentially just 
be masking the real problem...

Right now I'm seeing if taking out that check with no other changes causes this 
test to beast successfully. If it does, I can pursue finding the root cause 
independently of rearranging the code, _then_ change the organization of reload 
if the root cause seems orthogonal.

Digging, but slowly...

> Harden TestBulkSchemaConcurrent
> -------------------------------
>
>                 Key: SOLR-14750
>                 URL: https://issues.apache.org/jira/browse/SOLR-14750
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Tests
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: SOLR-14750.patch
>
>
> This test has been failing quite often lately. I poked around a bit and see 
> what I _think_ is evidence of a race condition in CoreContainer.reload where 
> a reload on the same core is happening from two places in close succession. 
> I'll attach a preliminary patch soon.
> Without this patch I had 25 failures out of 1,000 runs, with it 0.
> I consider this patch a WIP, putting up for comment. Well, it has nocommits 
> so... But In particular, I have to review some changes I made about which 
> name we're using for PendingCoreOps. I also want to back out my changes and 
> beast it again with some more logging to see if I can nail down that multiple 
> reloads are happening before declaring victory.
> What this does is put the name of the core we're reloading in pendingCoreOps 
> earlier in the reload process. Then the second call to reload will wait until 
> the first is completed. I also restructured it a bit because I don't like if 
> clauses that go on forever and a small else clause way down the code. I 
> inverted the test and bailed out of the method rather than fall off the end 
> after the else clause.
> One thing I don't like about this is two reloads in such rapid succession 
> seems wasteful. Even so, I can imagine that one reload gets through far 
> enough to load the schema, then a schema update changes the schema _then_ 
> calls reload. So I don't think just returning if there's a reload happening 
> on that core already is valid.
> More to come.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to