Do you see anything about session expiration in the logs? That is the likely culprit for something like this. You may need to raise the timeout: http://wiki.apache.org/solr/SolrCloud#FAQ
If you see no session timeouts, I don't have a guess yet. - Mark On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki <mrzewu...@gmail.com> wrote: > I'm experiencing same problem in Solr4.1 during bulk loading. After 50 > minutes of indexing the following error starts to occur: > > INFO: [core] webapp=/solr path=/update params={} {} 0 4 > Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the > leader, but locally we don't think so > at > org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343) > at > org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) > at > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387) > at > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112) > at > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96) > at > org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60) > at > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:365) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) > at > org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) > at > org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856) > at > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) > at > org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) > at > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Unknown Source) > Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log > Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext > waitForReplicasToComeUp > INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=50699 > > Then leader tries to sync with replica and after it finishes I can continue > loading. > None of SolrCloud nodes was restarted during that time. I don't remember > such behaviour in Solr4.0. Could it be related with the number of fields > indexed during loading ? I have a collection with about 2400 fields. I > can't reproduce same issue for other collections with much less fields per > record. > Regards. > > On 11 December 2012 19:50, Sudhakar Maddineni <maddineni...@gmail.com>wrote: > >> Just an update on this issue: >> We tried by increasing zookeeper client timeout settings to 30000ms in >> solr.xml (i think default is 15000ms), and haven't seen any issues from our >> tests. >> <cores ......... zkClientTimeout="30000" > >> >> Thanks, Sudhakar. >> >> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni >> <maddineni...@gmail.com>wrote: >> >>> We saw this error again today during our load test - basically, whenever >>> session is getting expired on the leader node, we are seeing the >>> error.After this happens, leader(001) is going into 'recovery' mode and >> all >>> the index updates are failing with "503- service unavailable" error >>> message.After some time(once recovery is successful), roles are swapped >>> i.e. 001 acting as the replica and 003 as leader. >>> >>> Btw, do you know why the connection to zookeeper[solr->zk] getting >>> interrupted in the middle? >>> is it because of the load(no of updates) we are putting on the cluster? >>> >>> Here is the exception stack trace: >>> >>> *Dec* *7*, *2012* *2:28:03* *PM* >> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader* >>> *WARNING:* >> *org.apache.zookeeper.KeeperException$SessionExpiredException:* >> *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader* >>> *at* >> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*) >>> *at* >> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*) >>> *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927* >>> ) >>> *at* >> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*) >>> *at* >> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*) >>> *at* >> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*) >>> *at* >> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*) >>> *at* >> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*) >>> *at* >> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*) >>> *at* *java.lang.Thread.run*(*Unknown* *Source*) >>> >>> Thx,Sudhakar. >>> >>> >>> >>> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni < >> maddineni...@gmail.com >>>> wrote: >>> >>>> Erick: >>>> Not seeing any page caching related issues... >>>> >>>> Mark: >>>> 1.Would this "waiting" on 003(replica) cause any inconsistencies in >> the >>>> zookeeper cluster state? I was also looking at the leader(001) logs at >> that >>>> time and seeing errors related to "*SEVERE: ClusterState says we are the >>>> leader, but locally we don't think so*". >>>> 2.Also, all of our servers in cluster were gone down when the index >>>> updates were running in parallel along with this issue.Do you see this >>>> related to the session expiry on 001? >>>> >>>> Here are the logs on 001 >>>> ========================= >>>> >>>> Dec 4, 2012 12:12:29 PM >>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader >>>> WARNING: >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>> KeeperErrorCode = Session expired for /overseer_elect/leader >>>> at >> org.apache.zookeeper.KeeperException.create(KeeperException.java:118) >>>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) >>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927) >>>> Dec 4, 2012 12:12:29 PM >>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader >>>> INFO: According to ZK I >>>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a >> leader. >>>> >>>> Dec 4, 2012 12:12:29 PM >> org.apache.solr.cloud.OverseerCollectionProcessor >>>> run >>>> WARNING: Overseer cannot talk to ZK >>>> >>>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log >>>> SEVERE: There was a problem finding the leader in >>>> zk:org.apache.solr.common.SolrException: Could not get leader props >>>> at >>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) >>>> at >>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673) >>>> Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log >>>> SEVERE: There was a problem finding the leader in >>>> zk:org.apache.solr.common.SolrException: Could not get leader props >>>> at >>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) >>>> at >>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673) >>>> Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log >>>> SEVERE: There was a problem making a request to the >>>> leader:org.apache.solr.common.SolrException: I was asked to wait on >> state >>>> down for <001>:8080_solr but I still do not see the request state. I see >>>> state: active live:true >>>> at >>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) >>>> Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log >>>> SEVERE: There was a problem making a request to the >>>> leader:org.apache.solr.common.SolrException: I was asked to wait on >> state >>>> down for <001>:8080_solr but I still do not see the request state. I see >>>> state: active live:true >>>> at >>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) >>>> .... >>>> .... >>>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log >>>> SEVERE: There was a problem finding the leader in >>>> zk:org.apache.solr.common.SolrException: Could not get leader props >>>> at >>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) >>>> .... >>>> .... >>>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log >>>> SEVERE: :org.apache.solr.common.SolrException: There was a problem >>>> finding the leader in zk >>>> at >>>> >> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080) >>>> at >>>> >> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273) >>>> Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader >>>> SEVERE: Error getting leader from zk >>>> org.apache.solr.common.SolrException: *There is conflicting information >>>> about the leader of shard: shard1 our state says:http:// >> <001>:8080/solr/core1/ >>>> but zookeeper says:http://<003>:8080/solr/core1/* >>>> * at >> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)* >>>> * at org.apache.solr.cloud.ZkController.register(ZkController.java:577)* >>>> Dec 4, 2012 12:22:30 PM >>>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess >>>> INFO: Running the leader process. >>>> .... >>>> .... >>>> >>>> Thanks for your inputs. >>>> Sudhakar. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmil...@gmail.com >>> wrote: >>>> >>>>> Yes - it means that 001 went down (or more likely had it's connection >> to >>>>> ZooKeeper interrupted? that's what I mean about a session timeout - if >> the >>>>> solr->zk link is broken for longer than the session timeout that will >>>>> trigger a leader election and when the connection is reestablished, the >>>>> node will have to recover). That waiting should stop as soon as 001 >> came >>>>> back up or reconnected to ZooKeeper. >>>>> >>>>> In fact, this waiting should not happen in this case - but only on >>>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming >> very >>>>> soon!): >>>>> >>>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the >> code >>>>> path >>>>> for a fresh cluster start rather than fail over. (Mark Miller) >>>>> >>>>> - Mark >>>>> >>>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <maddineni...@gmail.com >>> >>>>> wrote: >>>>> >>>>>> Yep, after restarting, cluster came back to normal state.We will run >>>>> couple of more tests and see if we could reproduce this issue. >>>>>> >>>>>> Btw, I am attaching the server logs before that 'INFO: Waiting until >>>>> we see more replicas' message.From the logs, we can see that leader >>>>> election process started on 003 which was the replica for 001 >>>>> initially.That means leader 001 went down at that time? >>>>>> >>>>>> logs on 003: >>>>>> ======== >>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>> runLeaderProcess >>>>>> INFO: Running the leader process. >>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>> shouldIBeLeader >>>>>> INFO: Checking if I should try and be the leader. >>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>> shouldIBeLeader >>>>>> INFO: My last published State was Active, it's okay to be the >>>>> leader. >>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>> runLeaderProcess >>>>>> INFO: I may be the new leader - try and sync >>>>>> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close >>>>>> WARNING: Stopping recovery for >> zkNodeName=<003>:8080_solr_core >>>>> core=core1. >>>>>> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync >>>>>> INFO: Sync replicas to http://<003>:8080/solr/core1/ >>>>>> 12:11:16 PM org.apache.solr.update.PeerSync sync >>>>>> INFO: PeerSync: core=core1 url=http://<003>:8080/solr START >>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100 >>>>>> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process >>>>>> INFO: Updating live nodes -> this message is on 002 >>>>>> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse >>>>>> WARNING: PeerSync: core=core1 url=http://<003>:8080/solr >>>>> exception talking to <001>:8080/solr/core1/, failed >>>>>> org.apache.solr.client.solrj.SolrServerException: Timeout >>>>> occured while waiting response from server at: <001>:8080/solr/core1 >>>>>> at >>>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409) >>>>>> at >>>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>>>>> at >>>>> >> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166) >>>>>> at >>>>> >> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133) >>>>>> at >> java.util.concurrent.FutureTask$Sync.innerRun(Unknown >>>>> Source) >>>>>> at java.util.concurrent.FutureTask.run(Unknown Source) >>>>>> at >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) >>>>>> at >> java.util.concurrent.FutureTask$Sync.innerRun(Unknown >>>>> Source) >>>>>> at java.util.concurrent.FutureTask.run(Unknown Source) >>>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) >>>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >>>>>> at java.lang.Thread.run(Unknown Source) >>>>>> Caused by: java.net.SocketTimeoutException: Read timed out >>>>>> at java.net.SocketInputStream.socketRead0(Native >> Method) >>>>>> at java.net.SocketInputStream.read(Unknown Source) >>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync >>>>>> INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE. >>>>> sync failed >>>>>> 12:11:46 PM org.apache.solr.common.SolrException log >>>>>> SEVERE: Sync Failed >>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>> rejoinLeaderElection >>>>>> INFO: There is a better leader candidate than us - going back >>>>> into recovery >>>>>> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery >>>>>> INFO: Running recovery - first canceling any ongoing recovery >>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run >>>>>> INFO: Starting recovery process. core=core1 >>>>> recoveringAfterStartup=false >>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery >>>>>> INFO: Attempting to PeerSync from <001>:8080/solr/core1/ >>>>> core=core1 - recoveringAfterStartup=false >>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync >>>>>> INFO: PeerSync: core=core1 url=http://<003>:8080/solr START >>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100 >>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>> runLeaderProcess >>>>>> INFO: Running the leader process. >>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>> waitForReplicasToComeUp >>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 >>>>> timeoutin=179999 >>>>>> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>> waitForReplicasToComeUp >>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 >>>>> timeoutin=179495 >>>>>> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>> waitForReplicasToComeUp >>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 >>>>> timeoutin=178985 >>>>>> .... >>>>>> .... >>>>>> >>>>>> Thanks for your help. >>>>>> Sudhakar. >>>>>> >>>>>> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <markrmil...@gmail.com> >>>>> wrote: >>>>>> The waiting logging had to happen on restart unless it's some kind of >>>>> bug. >>>>>> >>>>>> Beyond that, something is off, but I have no clue why - it seems your >>>>> clusterstate.json is not up to date at all. >>>>>> >>>>>> Have you tried restarting the cluster then? Does that help at all? >>>>>> >>>>>> Do you see any exceptions around zookeeper session timeouts? >>>>>> >>>>>> - Mark >>>>>> >>>>>> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni < >> maddineni...@gmail.com> >>>>> wrote: >>>>>> >>>>>>> Hey Mark, >>>>>>> >>>>>>> Yes, I am able to access all of the nodes under each shard from >>>>> solrcloud >>>>>>> admin UI. >>>>>>> >>>>>>> >>>>>>> - *It kind of looks like the urls solrcloud is using are not >>>>> accessible. >>>>>>> When you go to the admin page and the cloud tab, can you access >>>>> the urls it >>>>>>> shows for each shard? That is, if you click on of the links or >>>>> copy and >>>>>>> paste the address into a web browser, does it work?* >>>>>>> >>>>>>> Actually, I got these errors when my document upload task/job was >>>>> running, >>>>>>> not during the cluster restart. Also,job ran fine initially for the >>>>> first >>>>>>> one hour and started throwing these errors after indexing some >> docx. >>>>>>> >>>>>>> Thx, Sudhakar. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmil...@gmail.com >>> >>>>> wrote: >>>>>>> >>>>>>>> It kind of looks like the urls solrcloud is using are not >>>>> accessible. When >>>>>>>> you go to the admin page and the cloud tab, can you access the >> urls >>>>> it >>>>>>>> shows for each shard? That is, if you click on of the links or >> copy >>>>> and >>>>>>>> paste the address into a web browser, does it work? >>>>>>>> >>>>>>>> You may have to explicitly set the host= in solr.xml if it's not >>>>> auto >>>>>>>> detecting the right one. Make sure the ports like right too. >>>>>>>> >>>>>>>>> waitForReplicasToComeUp >>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 >>>>>>>>> timeoutin=179999 >>>>>>>> >>>>>>>> That happens when you stop the cluster and try to start it again - >>>>> before >>>>>>>> a leader is chosen, it will wait for all known replicas fora shard >>>>> to come >>>>>>>> up so that everyone can sync up and have a chance to be the best >>>>> leader. So >>>>>>>> at this point it was only finding one of 2 known replicas and >>>>> waiting for >>>>>>>> the second to come up. After a couple minutes (configurable) it >>>>> will just >>>>>>>> continue anyway without the missing replica (if it doesn't show >> up). >>>>>>>> >>>>>>>> - Mark >>>>>>>> >>>>>>>> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni < >>>>> maddineni...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> We are uploading solr documents to the index in batches using 30 >>>>> threads >>>>>>>>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit >>>>> set to >>>>>>>>> 10000. >>>>>>>>> In the code, we are using HttpSolrServer and add(inputDoc) method >>>>> to add >>>>>>>>> docx. >>>>>>>>> And, we have the following commit settings in solrconfig: >>>>>>>>> >>>>>>>>> <autoCommit> >>>>>>>>> <maxTime>300000</maxTime> >>>>>>>>> <maxDocs>10000</maxDocs> >>>>>>>>> <openSearcher>false</openSearcher> >>>>>>>>> </autoCommit> >>>>>>>>> >>>>>>>>> <autoSoftCommit> >>>>>>>>> <maxTime>1000</maxTime> >>>>>>>>> </autoSoftCommit> >>>>>>>>> >>>>>>>>> Cluster Details: >>>>>>>>> ---------------------------- >>>>>>>>> solr version - 4.0 >>>>>>>>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes] >>>>>>>>> numshards=2 , >>>>>>>>> 001, 002, 003 are the solr nodes and these three are behind the >>>>>>>>> loadbalancer <vip> >>>>>>>>> 001, 003 assigned to shard1; 002 assigned to shard2 >>>>>>>>> >>>>>>>>> >>>>>>>>> Logs:Getting the errors in the below sequence after uploading >> some >>>>> docx: >>>>>>>>> >>>>>>>> >>>>> >> ----------------------------------------------------------------------------------------------------------- >>>>>>>>> 003 >>>>>>>>> Dec 4, 2012 12:11:46 PM >>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>>>> waitForReplicasToComeUp >>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 >>>>>>>>> timeoutin=179999 >>>>>>>>> >>>>>>>>> 001 >>>>>>>>> Dec 4, 2012 12:12:59 PM >>>>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor >>>>>>>>> doDefensiveChecks >>>>>>>>> SEVERE: ClusterState says we are the leader, but locally we don't >>>>> think >>>>>>>> so >>>>>>>>> >>>>>>>>> 003 >>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log >>>>>>>>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed - >>>>> retrying ... >>>>>>>>> >>>>>>>>> 001 >>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log >>>>>>>>> SEVERE: Error uploading: org.apache.solr.common.SolrException: >>>>> Server at >>>>>>>>> <vip>/solr/core1. returned non ok status:503, message:Service >>>>> Unavailable >>>>>>>>> at >>>>>>>>> >>>>>>>> >>>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) >>>>>>>>> at >>>>>>>>> >>>>>>>> >>>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>>>>>>>> 001 >>>>>>>>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log >>>>>>>>> SEVERE: Error while trying to recover. >>>>>>>>> core=core1:org.apache.solr.common.SolrException: We are not the >>>>> leader >>>>>>>>> at >>>>>>>>> >>>>>>>> >>>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) >>>>>>>>> >>>>>>>>> 001 >>>>>>>>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log >>>>>>>>> SEVERE: Error uploading: >>>>>>>> org.apache.solr.client.solrj.SolrServerException: >>>>>>>>> IOException occured when talking to server at <vip>/solr/core1 >>>>>>>>> at >>>>>>>>> >>>>>>>> >>>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413) >>>>>>>>> at >>>>>>>>> >>>>>>>> >>>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>>>>>>>> at >>>>>>>>> >>>>>>>> >>>>> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) >>>>>>>>> at >> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116) >>>>>>>>> ... 5 lines omitted ... >>>>>>>>> at java.lang.Thread.run(Unknown Source) >>>>>>>>> Caused by: java.net.SocketException: Connection reset >>>>>>>>> >>>>>>>>> >>>>>>>>> After sometime, all the three servers are going down. >>>>>>>>> >>>>>>>>> Appreciate, if someone could let us know what we are missing. >>>>>>>>> >>>>>>>>> Thx,Sudhakar. >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> <logs_error.txt> >>>>> >>>>> >>>> >>> >>