Amit: The fact that "all instances are using no more than 30%...." isn't really indicative of whether or not GC pauses are a problem. If you have a large heap allocated to Java, then the to-be-collected objects will build up and _eventually_ you'll have a stop-the-world GC pause even though each time you happen to look you may be < 30%. Guessing here...
So how much memory are you allocating to the JVM? You can have GC stats printed to the log, see: http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ does the log of the instance that can't be reached tell you anything? Best, Erick On Mon, Apr 27, 2015 at 1:55 PM, Amit L <amitlal...@gmail.com> wrote: > Appreciate the response, to answer your questions. > > * Do you see this happen often? How often? > It has happened twice in five days. The first two days after deployment. > > * Are there any known network issues? > There are no obvious network issues but as these instances reside in AWS i > cannot rule it out network blips. > > * Do you have any idea about the GC on those replicas? > I have been monitoring the memory usage and all instances are using no more > than 30% of its JVM memory allocation. > > > > > On 27 April 2015 at 21:36, Anshum Gupta <ans...@anshumgupta.net> wrote: > >> Looks like LeaderInitiatedRecovery or LIR. When a leader receives a >> document (update) but fails to successfully forward it to a replica, it >> marks that replica as down and asks the replica to recover (hence the name, >> Leader Initiated Recovery). It could be due to multiple reasons e.g. >> network issue/GC. The replica generally comes back up and syncs with the >> leader transparently. As an end-user, you don't have to really worry much >> about this but if you want to dig deeper, here are a few questions that >> might help us in suggesting what to do/look at. >> * Do you see this happen often? How often? >> * Are there any known network issues? >> * Do you have any idea about the GC on those replicas? >> >> >> On Mon, Apr 27, 2015 at 1:25 PM, Amit L <amitlal...@gmail.com> wrote: >> >> > Hi, >> > >> > A few days ago I deployed a solr 4.9.0 cluster, which consists of 2 >> > collections. Each collection has 1 shard with 3 replicates on 3 different >> > machines. >> > >> > On the first day I noticed this error appear on the leader. Full Log - >> > http://pastebin.com/wcPMZb0s >> > >> > 4/23/2015, 2:34:37 PM SEVERE SolrCmdDistributor >> > org.apache.solr.client.solrj.SolrServerException: IOException occured >> when >> > talking to server at: >> > http://production-solrcloud-004:8080/solr/bookings_shard1_replica2 >> > >> > 4/23/2015, 2:34:37 PM WARNING DistributedUpdateProcessor >> > Error sending update >> > >> > 4/23/2015, 2:34:37 PM WARNING ZkController >> > Leader is publishing core=bookings_shard1_replica2 state=down on behalf >> of >> > un-reachable replica >> > http://production-solrcloud-004:8080/solr/bookings_shard1_replica2/; >> > forcePublishState? false >> > >> > >> > The other 2 replicas had 0 errors. >> > >> > I thought it may be a one off but the same error occured on day 2 which >> has >> > got me slighlty concerned. During these periods I didn't notice any >> issues >> > with the cluster and everything looks healthy in the cloud summary. All >> of >> > the instances are hosted on AWS. >> > >> > Any idea what may be causing this issue and what I can do to mitigate? >> > >> > Thanks >> > Amit >> > >> >> >> >> -- >> Anshum Gupta >>