Yep, after restarting, cluster came back to normal state.We will run couple of more tests and see if we could reproduce this issue.
Btw, I am attaching the server logs before that 'INFO: *Waiting until we see more replicas*' message.From the logs, we can see that leader election process started on 003 which was the replica for 001 initially.That means leader 001 went down at that time? logs on 003: ======== 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess INFO: Running the leader process. 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader INFO: Checking if I should try and be the leader. 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader INFO: My last published State was Active, it's okay to be the leader. 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess INFO: I may be the new leader - try and sync 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close WARNING: Stopping recovery for zkNodeName=<003>:8080_solr_core core=core1. 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync INFO: Sync replicas to http://<003>:8080/solr/core1/ 12:11:16 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=core1 url=http://<003>:8080/solr START replicas=[<001>:8080/solr/core1/] nUpdates=100 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process INFO: Updating live nodes -> this message is on 002 12:11:46 PM org.apache.solr.update.PeerSync handleResponse WARNING: PeerSync: core=core1 url=http://<003>:8080/solr exception talking to <001>:8080/solr/core1/, failed org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: <001>:8080/solr/core1 at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(Unknown Source) 12:11:46 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE. sync failed 12:11:46 PM org.apache.solr.common.SolrException log SEVERE: Sync Failed 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext rejoinLeaderElection INFO: There is a better leader candidate than us - going back into recovery 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery INFO: Running recovery - first canceling any ongoing recovery 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run INFO: Starting recovery process. core=core1 recoveringAfterStartup=false 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Attempting to PeerSync from <001>:8080/solr/core1/ core=core1 - recoveringAfterStartup=false 12:11:46 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=core1 url=http://<003>:8080/solr START replicas=[<001>:8080/solr/core1/] nUpdates=100 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess INFO: Running the leader process. 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: *Waiting until we see more replicas up: total=2 found=1 timeoutin=179999* 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: *Waiting until we see more replicas up: total=2 found=1 timeoutin=179495* 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: *Waiting until we see more replicas up: total=2 found=1 timeoutin=178985* .... .... Thanks for your help. Sudhakar. On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <markrmil...@gmail.com> wrote: > The waiting logging had to happen on restart unless it's some kind of bug. > > Beyond that, something is off, but I have no clue why - it seems your > clusterstate.json is not up to date at all. > > Have you tried restarting the cluster then? Does that help at all? > > Do you see any exceptions around zookeeper session timeouts? > > - Mark > > On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <maddineni...@gmail.com> > wrote: > > > Hey Mark, > > > > Yes, I am able to access all of the nodes under each shard from solrcloud > > admin UI. > > > > > > - *It kind of looks like the urls solrcloud is using are not > accessible. > > When you go to the admin page and the cloud tab, can you access the > urls it > > shows for each shard? That is, if you click on of the links or copy and > > paste the address into a web browser, does it work?* > > > > Actually, I got these errors when my document upload task/job was > running, > > not during the cluster restart. Also,job ran fine initially for the first > > one hour and started throwing these errors after indexing some docx. > > > > Thx, Sudhakar. > > > > > > > > > > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmil...@gmail.com> > wrote: > > > >> It kind of looks like the urls solrcloud is using are not accessible. > When > >> you go to the admin page and the cloud tab, can you access the urls it > >> shows for each shard? That is, if you click on of the links or copy and > >> paste the address into a web browser, does it work? > >> > >> You may have to explicitly set the host= in solr.xml if it's not auto > >> detecting the right one. Make sure the ports like right too. > >> > >>> waitForReplicasToComeUp > >>> INFO: Waiting until we see more replicas up: total=2 found=1 > >>> timeoutin=179999 > >> > >> That happens when you stop the cluster and try to start it again - > before > >> a leader is chosen, it will wait for all known replicas fora shard to > come > >> up so that everyone can sync up and have a chance to be the best > leader. So > >> at this point it was only finding one of 2 known replicas and waiting > for > >> the second to come up. After a couple minutes (configurable) it will > just > >> continue anyway without the missing replica (if it doesn't show up). > >> > >> - Mark > >> > >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <maddineni...@gmail.com> > >> wrote: > >> > >>> Hi, > >>> We are uploading solr documents to the index in batches using 30 > threads > >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to > >>> 10000. > >>> In the code, we are using HttpSolrServer and add(inputDoc) method to > add > >>> docx. > >>> And, we have the following commit settings in solrconfig: > >>> > >>> <autoCommit> > >>> <maxTime>300000</maxTime> > >>> <maxDocs>10000</maxDocs> > >>> <openSearcher>false</openSearcher> > >>> </autoCommit> > >>> > >>> <autoSoftCommit> > >>> <maxTime>1000</maxTime> > >>> </autoSoftCommit> > >>> > >>> Cluster Details: > >>> ---------------------------- > >>> solr version - 4.0 > >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes] > >>> numshards=2 , > >>> 001, 002, 003 are the solr nodes and these three are behind the > >>> loadbalancer <vip> > >>> 001, 003 assigned to shard1; 002 assigned to shard2 > >>> > >>> > >>> Logs:Getting the errors in the below sequence after uploading some > docx: > >>> > >> > ----------------------------------------------------------------------------------------------------------- > >>> 003 > >>> Dec 4, 2012 12:11:46 PM > org.apache.solr.cloud.ShardLeaderElectionContext > >>> waitForReplicasToComeUp > >>> INFO: Waiting until we see more replicas up: total=2 found=1 > >>> timeoutin=179999 > >>> > >>> 001 > >>> Dec 4, 2012 12:12:59 PM > >>> org.apache.solr.update.processor.DistributedUpdateProcessor > >>> doDefensiveChecks > >>> SEVERE: ClusterState says we are the leader, but locally we don't think > >> so > >>> > >>> 003 > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log > >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed - retrying > ... > >>> > >>> 001 > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log > >>> SEVERE: Error uploading: org.apache.solr.common.SolrException: Server > at > >>> <vip>/solr/core1. returned non ok status:503, message:Service > Unavailable > >>> at > >>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) > >>> at > >>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > >>> 001 > >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log > >>> SEVERE: Error while trying to recover. > >>> core=core1:org.apache.solr.common.SolrException: We are not the leader > >>> at > >>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) > >>> > >>> 001 > >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log > >>> SEVERE: Error uploading: > >> org.apache.solr.client.solrj.SolrServerException: > >>> IOException occured when talking to server at <vip>/solr/core1 > >>> at > >>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413) > >>> at > >>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > >>> at > >>> > >> > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) > >>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116) > >>> ... 5 lines omitted ... > >>> at java.lang.Thread.run(Unknown Source) > >>> Caused by: java.net.SocketException: Connection reset > >>> > >>> > >>> After sometime, all the three servers are going down. > >>> > >>> Appreciate, if someone could let us know what we are missing. > >>> > >>> Thx,Sudhakar. > >> > >> > >
1 12:11:59.000 PM Dec 4, 2012 12:11:59 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=167694 host=<003> 2 12:11:59.000 PM Dec 4, 2012 12:11:59 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=167174 host=<003> 3 12:11:58.000 PM Dec 4, 2012 12:11:58 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=168205 host=<003> 4 12:11:58.000 PM Dec 4, 2012 12:11:58 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=168725 host=<003> 5 12:11:57.000 PM Dec 4, 2012 12:11:57 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=169265 host=<003> 6 12:11:57.000 PM Dec 4, 2012 12:11:57 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=169775 host=<003> 7 12:11:56.000 PM Dec 4, 2012 12:11:56 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=170285 host=<003> 8 12:11:56.000 PM Dec 4, 2012 12:11:56 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=170795 host=<003> 9 12:11:55.000 PM Dec 4, 2012 12:11:55 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=171305 host=<003> 10 12:11:55.000 PM Dec 4, 2012 12:11:55 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=171813 host=<003> 11 12:11:54.000 PM Dec 4, 2012 12:11:54 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=172335 host=<003> 12 12:11:54.000 PM Dec 4, 2012 12:11:54 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=172845 host=<003> 13 12:11:53.000 PM Dec 4, 2012 12:11:53 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=173355 host=<003> 14 12:11:53.000 PM Dec 4, 2012 12:11:53 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=173865 host=<003> 15 12:11:52.000 PM Dec 4, 2012 12:11:52 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=174375 host=<003> 16 12:11:52.000 PM Dec 4, 2012 12:11:52 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=174885 host=<003> 17 12:11:51.000 PM Dec 4, 2012 12:11:51 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=175395 host=<003> 18 12:11:51.000 PM Dec 4, 2012 12:11:51 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=175905 host=<003> 19 12:11:50.000 PM Dec 4, 2012 12:11:50 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=176415 host=<003> 20 12:11:50.000 PM Dec 4, 2012 12:11:50 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=176925 host=<003> 21 12:11:49.000 PM Dec 4, 2012 12:11:49 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=177445 host=<003> 22 12:11:49.000 PM Dec 4, 2012 12:11:49 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=177955 host=<003> 23 12:11:48.000 PM Dec 4, 2012 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=178475 host=<003> 24 12:11:48.000 PM Dec 4, 2012 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=178985 host=<003> 25 12:11:47.000 PM Dec 4, 2012 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=179495 host=<003> 26 12:11:46.000 PM Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=179999 host=<003> 27 12:11:46.000 PM Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess INFO: Running the leader process. host=<003> 28 12:11:46.000 PM Dec 4, 2012 12:11:46 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=core1 url=http://<003>:8080/solr START replicas=[<001>:8080/solr/core1/] nUpdates=100 host=<003> 29 12:11:46.000 PM Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Attempting to PeerSync from <001>:8080/solr/core1/ core=core1 - recoveringAfterStartup=false host=<003> 30 12:11:46.000 PM Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run INFO: Starting recovery process. core=core1 recoveringAfterStartup=false host=<003> 31 12:11:46.000 PM Dec 4, 2012 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery INFO: Running recovery - first canceling any ongoing recovery host=<003> 32 12:11:46.000 PM Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext rejoinLeaderElection INFO: There is a better leader candidate than us - going back into recovery host=<003> 33 12:11:46.000 PM Dec 4, 2012 12:11:46 PM org.apache.solr.common.SolrException log SEVERE: Sync Failed host=<003> 34 12:11:46.000 PM Dec 4, 2012 12:11:46 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE. sync failed host=<003> 35 12:11:46.000 PM Dec 4, 2012 12:11:46 PM org.apache.solr.update.PeerSync handleResponse WARNING: PeerSync: core=core1 url=http://<003>:8080/solr exception talking to <001>:8080/solr/core1/, failed org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: <001>:8080/solr/core1 at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(Unknown Source) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:149) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:111) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:264) at org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:98) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:252) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:282) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:247) at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:216) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:298) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:647) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:464) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) ... 11 more Collapse back to 10 lines host=<003> 36 12:11:16.000 PM Dec 4, 2012 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process INFO: Updating live nodes host=<002> 37 12:11:16.000 PM Dec 4, 2012 12:11:16 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=core1 url=http://<003>:8080/solr START replicas=[<001>:8080/solr/core1/] nUpdates=100 host=<003> 38 12:11:16.000 PM Dec 4, 2012 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync INFO: Sync replicas to http://<003>:8080/solr/core1/ host=<003> 39 12:11:16.000 PM Dec 4, 2012 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close WARNING: Stopping recovery for zkNodeName=<003>:8080_solr_core1core=core1 host=<003> 40 12:11:16.000 PM Dec 4, 2012 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess INFO: I may be the new leader - try and sync host=<003> 41 12:11:16.000 PM Dec 4, 2012 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader INFO: My last published State was Active, it's okay to be the leader. host=<003> 42 12:11:16.000 PM Dec 4, 2012 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader INFO: Checking if I should try and be the leader. host=<003> 43 12:11:16.000 PM Dec 4, 2012 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess INFO: Running the leader process. host=<003>