I need to look at this a bit.

If either of you have the time/inclination, I have some anecdotal
evidence that if you specify

<writeLockTimeout>5000</writeLockTimeout>

in solrconfig.xml you won't see this problem. This is not a proper fix
of the root issue, but if you test this and don't see the error it'd
be a great help in me tracking this down.

Erick

On Tue, Aug 29, 2017 at 3:37 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 8/26/2017 9:53 AM, Erick Erickson wrote:
>> Setting loadOnStartup=false won't work for you in the long run,
>> although it does provide something of a hint. Setting this to false
>> means the core at that location simply has its coreDescriptor read and
>> stashed away in memory. The first time you _use_ that core an attempt
>> will be made to load it and that should fail with the write.lock
>> problem.
>>
>> There is extensive locking of core loading to prevent two threads from
>> trying to open the same core at the same time, if it were
>> fundamentally broken you wouldn't be the only person seeing this error
>> I'd guess.
>
> I had originally thought that i had loadonStartup enabled, but on second
> glance, turns out that it was disabled on all my cores.
>
> I set it to true and restarted again, hoping that would get rid of the
> issue and we would have some concrete information about triggering it.
> It didn't help -- the same problem still happens.
>
> The cores named "s1live" and "spark5live" have the "error opening new
> searcher" message in the admin UI for this run.  I see these lines in
> the log for s1live:
>
> 2017-08-29 21:58:22.467 INFO  (coreLoadExecutor-6-thread-2) [   ]
> o.a.s.c.CoreContainer Creating SolrCore 's1live' using configuration
> from instancedir /index/solr6/data/cores/s1_0, trusted=true
> 2017-08-29 21:58:23.863 INFO  (qtp1394336709-212) [   x:s1live]
> o.a.s.c.CoreContainer Creating SolrCore 's1live' using configuration
> from instancedir /index/solr6/data/cores/s1_0, trusted=true
>
> The first one is the coreLoadExecutor thread, no real surprise there.
> The second one starts with qtp, which I think makes it a query thread.
>
> Through several restarts, I have never seen a "build" core have this
> problem, it's always live cores.  I have some aggregation cores that
> have shards parameters in the request handlers.  Only live cores are
> mentioned there, and all queries (including the every-five-seconds
> health check ping queries used by haproxy) utilize those aggregation
> cores.  No requests are typically sent to "build" cores unless a full
> index rebuild is underway, which is fairly rare.
>
> My best guess for what's gone wrong is that there is some kind of race
> condition between the time when a loading core creates its searcher and
> the time when the core is actually fully loaded, and if requests come in
> for that core during that time, Solr will try to initialize another new
> searcher, instead of returning the "still loading" message that I also
> commonly see during Solr startup.  It is possible that this race
> condition only happens with distributed queries, but I'm not sure about
> that part.
>
> This idea also accounts for the fact that it is different cores with the
> problem every time -- restart timing versus query timing will rarely
> ever match up perfectly.
>
> Here is the full startup log from Solr 6.6 for the most recent run,
> which contains the two log lines I quoted above:
>
> https://www.dropbox.com/s/k1b6g0ldp9vces2/solr6_6-startup.log?dl=0
>
> With confirmation that another user is having the same problem, I've
> opened an issue.
>
> https://issues.apache.org/jira/browse/SOLR-11297
>
> Thanks,
> Shawn
>

Reply via email to