On 8/26/2017 9:53 AM, Erick Erickson wrote:
> Setting loadOnStartup=false won't work for you in the long run,
> although it does provide something of a hint. Setting this to false
> means the core at that location simply has its coreDescriptor read and
> stashed away in memory. The first time you _use_ that core an attempt
> will be made to load it and that should fail with the write.lock
> problem.
>
> There is extensive locking of core loading to prevent two threads from
> trying to open the same core at the same time, if it were
> fundamentally broken you wouldn't be the only person seeing this error
> I'd guess.

I had originally thought that i had loadonStartup enabled, but on second
glance, turns out that it was disabled on all my cores.

I set it to true and restarted again, hoping that would get rid of the
issue and we would have some concrete information about triggering it. 
It didn't help -- the same problem still happens.

The cores named "s1live" and "spark5live" have the "error opening new
searcher" message in the admin UI for this run.  I see these lines in
the log for s1live:

2017-08-29 21:58:22.467 INFO  (coreLoadExecutor-6-thread-2) [   ]
o.a.s.c.CoreContainer Creating SolrCore 's1live' using configuration
from instancedir /index/solr6/data/cores/s1_0, trusted=true
2017-08-29 21:58:23.863 INFO  (qtp1394336709-212) [   x:s1live]
o.a.s.c.CoreContainer Creating SolrCore 's1live' using configuration
from instancedir /index/solr6/data/cores/s1_0, trusted=true

The first one is the coreLoadExecutor thread, no real surprise there. 
The second one starts with qtp, which I think makes it a query thread.

Through several restarts, I have never seen a "build" core have this
problem, it's always live cores.  I have some aggregation cores that
have shards parameters in the request handlers.  Only live cores are
mentioned there, and all queries (including the every-five-seconds
health check ping queries used by haproxy) utilize those aggregation
cores.  No requests are typically sent to "build" cores unless a full
index rebuild is underway, which is fairly rare.

My best guess for what's gone wrong is that there is some kind of race
condition between the time when a loading core creates its searcher and
the time when the core is actually fully loaded, and if requests come in
for that core during that time, Solr will try to initialize another new
searcher, instead of returning the "still loading" message that I also
commonly see during Solr startup.  It is possible that this race
condition only happens with distributed queries, but I'm not sure about
that part.

This idea also accounts for the fact that it is different cores with the
problem every time -- restart timing versus query timing will rarely
ever match up perfectly.

Here is the full startup log from Solr 6.6 for the most recent run,
which contains the two log lines I quoted above:

https://www.dropbox.com/s/k1b6g0ldp9vces2/solr6_6-startup.log?dl=0

With confirmation that another user is having the same problem, I've
opened an issue.

https://issues.apache.org/jira/browse/SOLR-11297

Thanks,
Shawn

Reply via email to