Amit:

The fact that "all instances are using no more than 30%...." isn't
really indicative of whether or not GC pauses are a problem. If you
have a large heap allocated to Java, then the to-be-collected objects
will build up and _eventually_ you'll have a stop-the-world GC pause
even though each time you happen to look you may be < 30%. Guessing
here...

So how much memory are you allocating to the JVM? You can have GC
stats printed to the log, see:
http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

does the log of the instance that can't be reached tell you anything?

Best,
Erick

On Mon, Apr 27, 2015 at 1:55 PM, Amit L <amitlal...@gmail.com> wrote:
> Appreciate the response, to answer your questions.
>
> * Do you see this happen often? How often?
> It has happened twice in five days. The first two days after deployment.
>
> * Are there any known network issues?
> There are no obvious network issues but as these instances reside in AWS i
> cannot rule it out network blips.
>
> * Do you have any idea about the GC on those replicas?
> I have been monitoring the memory usage and all instances are using no more
> than 30% of its JVM memory allocation.
>
>
>
>
> On 27 April 2015 at 21:36, Anshum Gupta <ans...@anshumgupta.net> wrote:
>
>> Looks like LeaderInitiatedRecovery or LIR. When a leader receives a
>> document (update) but fails to successfully forward it to a replica, it
>> marks that replica as down and asks the replica to recover (hence the name,
>> Leader Initiated Recovery). It could be due to multiple reasons e.g.
>> network issue/GC. The replica generally comes back up and syncs with the
>> leader transparently. As an end-user, you don't have to really worry much
>> about this but if you want to dig deeper, here are a few questions that
>> might help us in suggesting what to do/look at.
>> * Do you see this happen often? How often?
>> * Are there any known network issues?
>> * Do you have any idea about the GC on those replicas?
>>
>>
>> On Mon, Apr 27, 2015 at 1:25 PM, Amit L <amitlal...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > A few days ago I deployed a solr 4.9.0 cluster, which consists of 2
>> > collections. Each collection has 1 shard with 3 replicates on 3 different
>> > machines.
>> >
>> > On the first day I noticed this error appear on the leader. Full Log -
>> > http://pastebin.com/wcPMZb0s
>> >
>> > 4/23/2015, 2:34:37 PM SEVERE SolrCmdDistributor
>> > org.apache.solr.client.solrj.SolrServerException: IOException occured
>> when
>> > talking to server at:
>> > http://production-solrcloud-004:8080/solr/bookings_shard1_replica2
>> >
>> > 4/23/2015, 2:34:37 PM WARNING DistributedUpdateProcessor
>> > Error sending update
>> >
>> > 4/23/2015, 2:34:37 PM WARNING ZkController
>> > Leader is publishing core=bookings_shard1_replica2 state=down on behalf
>> of
>> > un-reachable replica
>> > http://production-solrcloud-004:8080/solr/bookings_shard1_replica2/;
>> > forcePublishState? false
>> >
>> >
>> > The other 2 replicas had 0 errors.
>> >
>> > I thought it may be a one off but the same error occured on day 2 which
>> has
>> > got me slighlty concerned. During these periods I didn't notice any
>> issues
>> > with the cluster and everything looks healthy in the cloud summary. All
>> of
>> > the instances are hosted on AWS.
>> >
>> > Any idea what may be causing this issue and what I can do to mitigate?
>> >
>> > Thanks
>> > Amit
>> >
>>
>>
>>
>> --
>> Anshum Gupta
>>

Reply via email to