I need to look at this a bit. If either of you have the time/inclination, I have some anecdotal evidence that if you specify
<writeLockTimeout>5000</writeLockTimeout> in solrconfig.xml you won't see this problem. This is not a proper fix of the root issue, but if you test this and don't see the error it'd be a great help in me tracking this down. Erick On Tue, Aug 29, 2017 at 3:37 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 8/26/2017 9:53 AM, Erick Erickson wrote: >> Setting loadOnStartup=false won't work for you in the long run, >> although it does provide something of a hint. Setting this to false >> means the core at that location simply has its coreDescriptor read and >> stashed away in memory. The first time you _use_ that core an attempt >> will be made to load it and that should fail with the write.lock >> problem. >> >> There is extensive locking of core loading to prevent two threads from >> trying to open the same core at the same time, if it were >> fundamentally broken you wouldn't be the only person seeing this error >> I'd guess. > > I had originally thought that i had loadonStartup enabled, but on second > glance, turns out that it was disabled on all my cores. > > I set it to true and restarted again, hoping that would get rid of the > issue and we would have some concrete information about triggering it. > It didn't help -- the same problem still happens. > > The cores named "s1live" and "spark5live" have the "error opening new > searcher" message in the admin UI for this run. I see these lines in > the log for s1live: > > 2017-08-29 21:58:22.467 INFO (coreLoadExecutor-6-thread-2) [ ] > o.a.s.c.CoreContainer Creating SolrCore 's1live' using configuration > from instancedir /index/solr6/data/cores/s1_0, trusted=true > 2017-08-29 21:58:23.863 INFO (qtp1394336709-212) [ x:s1live] > o.a.s.c.CoreContainer Creating SolrCore 's1live' using configuration > from instancedir /index/solr6/data/cores/s1_0, trusted=true > > The first one is the coreLoadExecutor thread, no real surprise there. > The second one starts with qtp, which I think makes it a query thread. > > Through several restarts, I have never seen a "build" core have this > problem, it's always live cores. I have some aggregation cores that > have shards parameters in the request handlers. Only live cores are > mentioned there, and all queries (including the every-five-seconds > health check ping queries used by haproxy) utilize those aggregation > cores. No requests are typically sent to "build" cores unless a full > index rebuild is underway, which is fairly rare. > > My best guess for what's gone wrong is that there is some kind of race > condition between the time when a loading core creates its searcher and > the time when the core is actually fully loaded, and if requests come in > for that core during that time, Solr will try to initialize another new > searcher, instead of returning the "still loading" message that I also > commonly see during Solr startup. It is possible that this race > condition only happens with distributed queries, but I'm not sure about > that part. > > This idea also accounts for the fact that it is different cores with the > problem every time -- restart timing versus query timing will rarely > ever match up perfectly. > > Here is the full startup log from Solr 6.6 for the most recent run, > which contains the two log lines I quoted above: > > https://www.dropbox.com/s/k1b6g0ldp9vces2/solr6_6-startup.log?dl=0 > > With confirmation that another user is having the same problem, I've > opened an issue. > > https://issues.apache.org/jira/browse/SOLR-11297 > > Thanks, > Shawn >