Thanks Erick,
Pretty stuck with the delete-by-query as it can be deleting a million docs.

I'll work through what you have said and also try to find the root cause of the 
recovery.




Regards

Russell Taylor



-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 22 May 2019 20:17
To: solr-user@lucene.apache.org
Subject: Re: CloudSolrClient (any version). Find the node your query has 
connected to.

WARNING - External email from lucene.apache.org

You have to be a little careful here, one thing I learned relatively recently 
is that there are in-memory structures that hold pointers to _all_ 
un-searchable docs (i.e. no new searchers have been opened since the doc was 
added/updated) to support real-time get. So if you’re indexing a _lot_ of docs 
that internal structure can grow quite large….

FWIW, delete-by-query is painful. Each one has to lock all indexing on all 
replicas while it completes. If you can use delete-by-id it’d be better.

Let’s back up a bit and look at _why_ your nodes go into recovery…. Leave the 
replicas on if you can and look for “Leader Initiated Recovery” (not sure 
that’s the exact phrase, but you’ll see something very like that). If that’s 
the case, then one situation we’ve seen is that a request takes too long to 
return from a follower. So the sequence looks like this:

- leader gets update
- leader indexes locally _and_ forwards to follower
- follower is busy (and the delete-by-query could be why) and takes too long to 
respond so the request times out
- leader says “hmmm, I don’t know what happened so I’ll tell the follower to 
recover”.

Given your heavy update rate, there’ll be no chance for “peer sync” to fully 
recover so it’ll go into full recovery. That can sometimes be fixed by simply 
lengthening the timeout.

Otherwise also take a look at the logs and see if you can find a root cause for 
the replica going into recovery and we should see if we can fix that.

I didn’t ask what versions of Solr you’re using, but in the 7x code line (7.3 
IIRC) significant work was done to make recovery less likely.

Best,
Erick

> On May 22, 2019, at 10:27 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>
> On 5/22/2019 10:47 AM, Russell Taylor wrote:
>> I will add that we have set commits to be only called by the loading 
>> program. We have turned off soft and autoCommits in the solrconfig.xml.
>
> Don't turn off autoCommit.  Regular hard commits, typically with openSearcher 
> set to false so they don't interfere with change visibility, are extremely 
> important for good Solr operation.  Without it, the transaction logs will 
> grow out of control.  In addition to taking a lot of disk space, that will 
> cause a Solr restart to happen VERY slowly.  Note that a hard commit with 
> openSearcher set to false will be VERY fast -- doing them frequently is 
> usually not a problem for performance.  Sample configs in recent Solr 
> versions ship with autoCommit set to 15 seconds and openSearcher set to false.
>
> Not using autoSoftCommit is a reasonable thing to do if you do not need that 
> functionality ... but don't disable autoCommit.
>
> Thanks,
> Shawn


________________________________

This message may contain confidential information and is intended for specific 
recipients unless explicitly noted otherwise. If you have reason to believe you 
are not an intended recipient of this message, please delete it and notify the 
sender. This message may not represent the opinion of Intercontinental 
Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a 
contract or guarantee. Unencrypted electronic mail is not secure and the 
recipient of this message is expected to provide safeguards from viruses and 
pursue alternate means of communication where privacy or a binding message is 
desired.

Reply via email to