OK, great. I've eliminated OOM errors after increasing the memory
allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal
setting but this is all I can have right now on the Solr machines. I'll
look into GC logging too.

Turning to the Solr logs, a quick sweep revealed a lot of "Caused by:
java.net.SocketException: Connection reset" lines, but this isn't very
explicit. I suppose I'll have to cross-check on the concerned server(s).

Anyway, I'll have a try at the updated setting and I'll get back to the
list.

Thanks,
John.


On 21/12/15 17:21, Erick Erickson wrote:
> ZK isn't pushed all that heavily, although all things are possible. Still,
> for maintenance putting Zk on separate machines is a good idea. They
> don't have to be very beefy machines.
>
> Look in your logs for LeaderInitiatedRecovery messages. If you find them
> then _probably_ you have some issues with timeouts, often due to
> excessive GC pauses, turning on GC logging can help you get
> a handle on that.
>
> Another "popular" reason for nodes going into recovery is Out Of Memory
> errors, which is easy to do in a system that gets set up and
> then more and more docs get added to it. You either have to move
> some collections to other Solr instances, get more memory to the JVM
> (but watch out for GC pauses and starving the OS's memory) etc.
>
> But the Solr logs are the place I'd look first for any help in understanding
> the root cause of nodes going into recovery.
>
> Best,
> Erick
>
> On Mon, Dec 21, 2015 at 8:04 AM, John Smith <solr-u...@remailme.net> wrote:
>> Thanks, I'll have a try. Can the load on the Solr servers impair the zk
>> response time in the current situation, which would cause the desync? Is
>> this the reason for the change?
>>
>> John.
>>
>>
>> On 21/12/15 16:45, Erik Hatcher wrote:
>>> John - the first recommendation that pops out is to run (only) 3 
>>> zookeepers, entirely separate from Solr servers, and then as many Solr 
>>> servers from there that you need to scale indexing and querying to your 
>>> needs.  Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 
>>> servers at your disposal.
>>>
>>>
>>> —
>>> Erik Hatcher, Senior Solutions Architect
>>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>>
>>>
>>>
>>>> On Dec 21, 2015, at 10:37 AM, John Smith <solr-u...@remailme.net> wrote:
>>>>
>>>> This is my first experience with SolrCloud, so please bear with me.
>>>>
>>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>>>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>>>> (20Gb) and some very small. All of them have only one shard. The bigger
>>>> ones are almost constantly being updated (and of course queried at the
>>>> same time).
>>>>
>>>> I've had a huge number of errors, many different ones. At some point the
>>>> system seemed rather stable, but I've tried to add a few new collections
>>>> and things went wrong again. The usual symptom is that some cores stop
>>>> synchronizing; sometimes an entire server is shown as "gone" (although
>>>> it's still alive and well). When I add a core on a server, another (or
>>>> several others) often goes down on that server. Even when the system is
>>>> rather stable some cores are shown as recovering. When restarting a
>>>> server it takes a very long time (30 min at least) to fully recover.
>>>>
>>>> Some of the many errors I've got (I've skipped the warnings):
>>>> - org.apache.solr.common.SolrException: Error trying to proxy request
>>>> for url
>>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>>>> up to try to start recovery on replica
>>>> - org.apache.solr.common.SolrException; Error while trying to recover.
>>>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>>>> was found after waiting
>>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>>>> tlog=null}
>>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>>>> after succesful recovery
>>>> - org.apache.solr.common.SolrException; Could not find core to call 
>>>> recovery
>>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
>>>> Unable to create core
>>>> - org.apache.solr.request.SolrRequestInfo; prev == info : false
>>>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
>>>> not closed!
>>>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
>>>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
>>>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
>>>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
>>>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>>>> for connection from pool
>>>> - and so on...
>>>>
>>>> Any advice on where I should start? I've checked disk space, memory
>>>> usage, max number of open files, everything seems fine there. My guess
>>>> is that the configuration is rather unaltered from the defaults. I've
>>>> extended timeouts in Zookeeper already.
>>>>
>>>> Thanks,
>>>> John
>>>>

Reply via email to