On a system with about 1600 collections, each having one shard and a
replication factor of two it took around an hour to recover completely
after an instance restart. The setup used HDFS for the storage. And we
are using Solr 7.4 at the moment. The overseer queue management helped
us a lot! Before that Solr could easily swirl into death if the queue
grew too fast. I haven't checked the logs on what the recovery does. Is
there anything specific to look for?
During the recovery one can see how Solr is going over the replicas one
by one and never really working on more then about 5 replicas at a time,
often less. The progress also seems to be done in alphabetical order. I
believe that used to be different in older versions. I will try to give
the coreLoadThreads setting a test.
Hendrik
On 25.01.2019 16:51, Erick Erickson wrote:
That's just _loading_, recovery happens later so I'd
be surprised if this really made a difference, but you
never know.
I'm more interested in _why_ recovery takes so long.
and why recovery happens in the first place. It's normal
for replicas when starting up to to from down->recovering->active,
that's just part of the normal cycle. But the recovering state
should be relatively short absent having to replicate the
index from the leader.
If active indexing is going on, then the replicas may have to
copy their index down from the leader. Does this happen
on a system that is not indexing?
What version of Solr? All the state changes go through
the Overseer, and there were some very significant improvements
in Solr 6.6+, see:
https://issues.apache.org/jira/browse/SOLR-10265
And can you put a number to "rather long"? There's a built-in
3 minute wait for leader election if there's no leader for
a slice. That's not relevant if the replica in recovery
belongs to a shard that already has a leader, but if you
restart your entire cluster it can come into play.
Best,
Erick
On Fri, Jan 25, 2019 at 3:32 AM Hendrik Haddorp <hendrik.hadd...@gmx.net> wrote:
Thanks, that sounds good. Didn't know that parameter.
On 25.01.2019 11:23, Vadim Ivanov wrote:
You can try to tweak solr.xml
coreLoadThreads
Specifies the number of threads that will be assigned to load cores in parallel.
https://lucene.apache.org/solr/guide/7_6/format-of-solr-xml.html
-----Original Message-----
From: Hendrik Haddorp [mailto:hendrik.hadd...@gmx.net]
Sent: Friday, January 25, 2019 11:39 AM
To: solr-user@lucene.apache.org
Subject: SolrCloud recovery
Hi,
I have a SolrCloud with many collections. When I restart an instance and
the replicas are recovering I noticed that number replicas recovering at
one point is usually around 5. This results in the recovery to take
rather long. Is there a configuration option that controls how many
replicas can recover in parallel?
thanks,
Hendrik