On 8/26/2017 9:53 AM, Erick Erickson wrote: > Setting loadOnStartup=false won't work for you in the long run, > although it does provide something of a hint. Setting this to false > means the core at that location simply has its coreDescriptor read and > stashed away in memory. The first time you _use_ that core an attempt > will be made to load it and that should fail with the write.lock > problem. > > There is extensive locking of core loading to prevent two threads from > trying to open the same core at the same time, if it were > fundamentally broken you wouldn't be the only person seeing this error > I'd guess.
I had originally thought that i had loadonStartup enabled, but on second glance, turns out that it was disabled on all my cores. I set it to true and restarted again, hoping that would get rid of the issue and we would have some concrete information about triggering it. It didn't help -- the same problem still happens. The cores named "s1live" and "spark5live" have the "error opening new searcher" message in the admin UI for this run. I see these lines in the log for s1live: 2017-08-29 21:58:22.467 INFO (coreLoadExecutor-6-thread-2) [ ] o.a.s.c.CoreContainer Creating SolrCore 's1live' using configuration from instancedir /index/solr6/data/cores/s1_0, trusted=true 2017-08-29 21:58:23.863 INFO (qtp1394336709-212) [ x:s1live] o.a.s.c.CoreContainer Creating SolrCore 's1live' using configuration from instancedir /index/solr6/data/cores/s1_0, trusted=true The first one is the coreLoadExecutor thread, no real surprise there. The second one starts with qtp, which I think makes it a query thread. Through several restarts, I have never seen a "build" core have this problem, it's always live cores. I have some aggregation cores that have shards parameters in the request handlers. Only live cores are mentioned there, and all queries (including the every-five-seconds health check ping queries used by haproxy) utilize those aggregation cores. No requests are typically sent to "build" cores unless a full index rebuild is underway, which is fairly rare. My best guess for what's gone wrong is that there is some kind of race condition between the time when a loading core creates its searcher and the time when the core is actually fully loaded, and if requests come in for that core during that time, Solr will try to initialize another new searcher, instead of returning the "still loading" message that I also commonly see during Solr startup. It is possible that this race condition only happens with distributed queries, but I'm not sure about that part. This idea also accounts for the fact that it is different cores with the problem every time -- restart timing versus query timing will rarely ever match up perfectly. Here is the full startup log from Solr 6.6 for the most recent run, which contains the two log lines I quoted above: https://www.dropbox.com/s/k1b6g0ldp9vces2/solr6_6-startup.log?dl=0 With confirmation that another user is having the same problem, I've opened an issue. https://issues.apache.org/jira/browse/SOLR-11297 Thanks, Shawn