Hi,

Thanks Erick for your input. I've added GC logging, but it was normal
when the error came again this morning. I was adding a large collection
(27 Gb): on the first server all went well. At the time I created the
core on a second server, it was almost immediately disconnected from the
cloud. This time I could nail what seems to be the root cause in the logs:

ERROR - 2015-12-22 09:39:29.029; [   ]
org.apache.solr.common.SolrException; OverseerAutoReplicaFailoverThread
had an error in its thread work
loop.:org.apache.solr.common.SolrException: Error reading cluster properties
        at
org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:738)
        at
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:153)
        at
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:132)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method)
        at
org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108)
        at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76)
        at
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308)
        at
org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:731)
        ... 3 more

WARN  - 2015-12-22 09:39:29.890; [   ]
org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter; Keeper
Exception
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /live_nodes
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
        at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
        at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
        at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
        at
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
        at
org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter.printTree(ZookeeperInfoHandler.java:581)
        at
org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter.print(ZookeeperInfoHandler.java:527)
        at
org.apache.solr.handler.admin.ZookeeperInfoHandler.handleRequestBody(ZookeeperInfoHandler.java:406)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
        at
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:664)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:438)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181)
...

After that the server was marked as "gone" in the cloud graph and it
took a long time to register itself again and recover.

I haven't changed the ZK config yet as per your suggestion below. Could
this fix the problem? Do you have any other suggestion?

Thanks,
John


On 21/12/15 17:39, Erick Erickson wrote:
> right, do note that when you _do_ hit an OOM, you really
> should restart the JVM as nothing is _really_ certain after
> that.
>
> You're right, just bumping the memory is a band-aid, but
> whatever gets you by. Lucene makes heavy use of
> MMapDirectory which uses OS memory rather than JVM
> memory, so you're robbing Peter to pay Paul when you
> allocate high percentages of the physical memory to the JVM.
> See Uwe's excellent blog here:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> And yeah, your "connection reset" errors may well be GC-related
> if you're getting a lot of stop-the-world GC pauses.
>
> Sounds like you inherited a system that's getting more and more
> docs added to it over time and outgrew its host, but that's a guess.
>
> And you get to deal with it over the holidays too ;)
>
> Best,
> Erick
>
> On Mon, Dec 21, 2015 at 8:33 AM, John Smith <solr-u...@remailme.net> wrote:
>> OK, great. I've eliminated OOM errors after increasing the memory
>> allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal
>> setting but this is all I can have right now on the Solr machines. I'll
>> look into GC logging too.
>>
>> Turning to the Solr logs, a quick sweep revealed a lot of "Caused by:
>> java.net.SocketException: Connection reset" lines, but this isn't very
>> explicit. I suppose I'll have to cross-check on the concerned server(s).
>>
>> Anyway, I'll have a try at the updated setting and I'll get back to the
>> list.
>>
>> Thanks,
>> John.
>>
>>
>> On 21/12/15 17:21, Erick Erickson wrote:
>>> ZK isn't pushed all that heavily, although all things are possible. Still,
>>> for maintenance putting Zk on separate machines is a good idea. They
>>> don't have to be very beefy machines.
>>>
>>> Look in your logs for LeaderInitiatedRecovery messages. If you find them
>>> then _probably_ you have some issues with timeouts, often due to
>>> excessive GC pauses, turning on GC logging can help you get
>>> a handle on that.
>>>
>>> Another "popular" reason for nodes going into recovery is Out Of Memory
>>> errors, which is easy to do in a system that gets set up and
>>> then more and more docs get added to it. You either have to move
>>> some collections to other Solr instances, get more memory to the JVM
>>> (but watch out for GC pauses and starving the OS's memory) etc.
>>>
>>> But the Solr logs are the place I'd look first for any help in understanding
>>> the root cause of nodes going into recovery.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Dec 21, 2015 at 8:04 AM, John Smith <solr-u...@remailme.net> wrote:
>>>> Thanks, I'll have a try. Can the load on the Solr servers impair the zk
>>>> response time in the current situation, which would cause the desync? Is
>>>> this the reason for the change?
>>>>
>>>> John.
>>>>
>>>>
>>>> On 21/12/15 16:45, Erik Hatcher wrote:
>>>>> John - the first recommendation that pops out is to run (only) 3 
>>>>> zookeepers, entirely separate from Solr servers, and then as many Solr 
>>>>> servers from there that you need to scale indexing and querying to your 
>>>>> needs.  Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 
>>>>> servers at your disposal.
>>>>>
>>>>>
>>>>> —
>>>>> Erik Hatcher, Senior Solutions Architect
>>>>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>>>>
>>>>>
>>>>>
>>>>>> On Dec 21, 2015, at 10:37 AM, John Smith <solr-u...@remailme.net> wrote:
>>>>>>
>>>>>> This is my first experience with SolrCloud, so please bear with me.
>>>>>>
>>>>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>>>>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>>>>>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>>>>>> (20Gb) and some very small. All of them have only one shard. The bigger
>>>>>> ones are almost constantly being updated (and of course queried at the
>>>>>> same time).
>>>>>>
>>>>>> I've had a huge number of errors, many different ones. At some point the
>>>>>> system seemed rather stable, but I've tried to add a few new collections
>>>>>> and things went wrong again. The usual symptom is that some cores stop
>>>>>> synchronizing; sometimes an entire server is shown as "gone" (although
>>>>>> it's still alive and well). When I add a core on a server, another (or
>>>>>> several others) often goes down on that server. Even when the system is
>>>>>> rather stable some cores are shown as recovering. When restarting a
>>>>>> server it takes a very long time (30 min at least) to fully recover.
>>>>>>
>>>>>> Some of the many errors I've got (I've skipped the warnings):
>>>>>> - org.apache.solr.common.SolrException: Error trying to proxy request
>>>>>> for url
>>>>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>>>>>> up to try to start recovery on replica
>>>>>> - org.apache.solr.common.SolrException; Error while trying to recover.
>>>>>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>>>>>> was found after waiting
>>>>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>>>>>> tlog=null}
>>>>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>>>>>> after succesful recovery
>>>>>> - org.apache.solr.common.SolrException; Could not find core to call 
>>>>>> recovery
>>>>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
>>>>>> Unable to create core
>>>>>> - org.apache.solr.request.SolrRequestInfo; prev == info : false
>>>>>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
>>>>>> not closed!
>>>>>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
>>>>>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
>>>>>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
>>>>>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from 
>>>>>> shard
>>>>>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>>>>>> for connection from pool
>>>>>> - and so on...
>>>>>>
>>>>>> Any advice on where I should start? I've checked disk space, memory
>>>>>> usage, max number of open files, everything seems fine there. My guess
>>>>>> is that the configuration is rather unaltered from the defaults. I've
>>>>>> extended timeouts in Zookeeper already.
>>>>>>
>>>>>> Thanks,
>>>>>> John
>>>>>>

Reply via email to