ZK isn't pushed all that heavily, although all things are possible. Still,
for maintenance putting Zk on separate machines is a good idea. They
don't have to be very beefy machines.

Look in your logs for LeaderInitiatedRecovery messages. If you find them
then _probably_ you have some issues with timeouts, often due to
excessive GC pauses, turning on GC logging can help you get
a handle on that.

Another "popular" reason for nodes going into recovery is Out Of Memory
errors, which is easy to do in a system that gets set up and
then more and more docs get added to it. You either have to move
some collections to other Solr instances, get more memory to the JVM
(but watch out for GC pauses and starving the OS's memory) etc.

But the Solr logs are the place I'd look first for any help in understanding
the root cause of nodes going into recovery.

Best,
Erick

On Mon, Dec 21, 2015 at 8:04 AM, John Smith <solr-u...@remailme.net> wrote:
> Thanks, I'll have a try. Can the load on the Solr servers impair the zk
> response time in the current situation, which would cause the desync? Is
> this the reason for the change?
>
> John.
>
>
> On 21/12/15 16:45, Erik Hatcher wrote:
>> John - the first recommendation that pops out is to run (only) 3 zookeepers, 
>> entirely separate from Solr servers, and then as many Solr servers from 
>> there that you need to scale indexing and querying to your needs.  Sounds 
>> like 3 ZKs + 2 Solr’s is a good start, given you have 5 servers at your 
>> disposal.
>>
>>
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>
>>
>>
>>> On Dec 21, 2015, at 10:37 AM, John Smith <solr-u...@remailme.net> wrote:
>>>
>>> This is my first experience with SolrCloud, so please bear with me.
>>>
>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>>> (20Gb) and some very small. All of them have only one shard. The bigger
>>> ones are almost constantly being updated (and of course queried at the
>>> same time).
>>>
>>> I've had a huge number of errors, many different ones. At some point the
>>> system seemed rather stable, but I've tried to add a few new collections
>>> and things went wrong again. The usual symptom is that some cores stop
>>> synchronizing; sometimes an entire server is shown as "gone" (although
>>> it's still alive and well). When I add a core on a server, another (or
>>> several others) often goes down on that server. Even when the system is
>>> rather stable some cores are shown as recovering. When restarting a
>>> server it takes a very long time (30 min at least) to fully recover.
>>>
>>> Some of the many errors I've got (I've skipped the warnings):
>>> - org.apache.solr.common.SolrException: Error trying to proxy request
>>> for url
>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>>> up to try to start recovery on replica
>>> - org.apache.solr.common.SolrException; Error while trying to recover.
>>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>>> was found after waiting
>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>>> tlog=null}
>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>>> after succesful recovery
>>> - org.apache.solr.common.SolrException; Could not find core to call recovery
>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
>>> Unable to create core
>>> - org.apache.solr.request.SolrRequestInfo; prev == info : false
>>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
>>> not closed!
>>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
>>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
>>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
>>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
>>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>>> for connection from pool
>>> - and so on...
>>>
>>> Any advice on where I should start? I've checked disk space, memory
>>> usage, max number of open files, everything seems fine there. My guess
>>> is that the configuration is rather unaltered from the defaults. I've
>>> extended timeouts in Zookeeper already.
>>>
>>> Thanks,
>>> John
>>>
>>
>

Reply via email to