Re: SolrCloud - ClusterState says we are the leader,but locally ...

Marcin Rzewucki Sat, 02 Feb 2013 16:36:13 -0800

I'm experiencing same problem in Solr4.1 during bulk loading. After 50
minutes of indexing the following error starts to occur:


INFO: [core] webapp=/solr path=/update params={} {} 0 4
Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the
leader, but locally we don't think so
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
        at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
        at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
        at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
        at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
        at
org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
        at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
        at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
        at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
        at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
        at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
        at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
        at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
        at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
        at org.eclipse.jetty.server.Server.handle(Server.java:365)
        at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
        at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
        at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
        at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
        at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
        at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
        at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Unknown Source)
Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=50699

Then leader tries to sync with replica and after it finishes I can continue
loading.
None of SolrCloud nodes was restarted during that time. I don't remember
such behaviour in Solr4.0. Could it be related with the number of fields
indexed during loading ? I have a collection with about 2400 fields. I
can't reproduce same issue for other collections with much less fields per
record.
Regards.

On 11 December 2012 19:50, Sudhakar Maddineni <maddineni...@gmail.com>wrote:

> Just an update on this issue:
>    We tried by increasing zookeeper client timeout settings to 30000ms in
> solr.xml (i think default is 15000ms), and haven't seen any issues from our
> tests.
> <cores .........           zkClientTimeout="30000" >
>
> Thanks, Sudhakar.
>
> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
> <maddineni...@gmail.com>wrote:
>
> > We saw this error again today during our load test - basically, whenever
> > session is getting expired on the leader node, we are seeing the
> > error.After this happens, leader(001) is going into 'recovery' mode and
> all
> > the index updates are failing with "503- service unavailable" error
> > message.After some time(once recovery is successful), roles are swapped
> > i.e. 001 acting as the replica and 003 as leader.
> >
> > Btw, do you know why the connection to zookeeper[solr->zk] getting
> > interrupted in the middle?
> > is it because of the load(no of updates) we are putting on the cluster?
> >
> > Here is the exception stack trace:
> >
> > *Dec* *7*, *2012* *2:28:03* *PM*
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
> > *WARNING:*
> *org.apache.zookeeper.KeeperException$SessionExpiredException:*
> *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader*
> >       *at*
> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
> >       *at*
> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
> >       *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
> > )
> >       *at*
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
> >       *at*
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
> >       *at*
> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
> >       *at*
> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
> >       *at*
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
> >       *at*
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
> >       *at* *java.lang.Thread.run*(*Unknown* *Source*)
> >
> > Thx,Sudhakar.
> >
> >
> >
> > On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <
> maddineni...@gmail.com
> > > wrote:
> >
> >> Erick:
> >>   Not seeing any page caching related issues...
> >>
> >> Mark:
> >>   1.Would this "waiting" on 003(replica) cause any inconsistencies in
> the
> >> zookeeper cluster state? I was also looking at the leader(001) logs at
> that
> >> time and seeing errors related to "*SEVERE: ClusterState says we are the
> >> leader, but locally we don't think so*".
> >>   2.Also, all of our servers in cluster were gone down when the index
> >> updates were running in parallel along with this issue.Do you see this
> >> related to the session expiry on 001?
> >>
> >> Here are the logs on 001
> >> =========================
> >>
> >> Dec 4, 2012 12:12:29 PM
> >> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
> >> WARNING:
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> KeeperErrorCode = Session expired for /overseer_elect/leader
> >>  at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
> >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> >>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
> >> Dec 4, 2012 12:12:29 PM
> >> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
> >> INFO: According to ZK I
> >> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a
> leader.
> >>
> >> Dec 4, 2012 12:12:29 PM
> org.apache.solr.cloud.OverseerCollectionProcessor
> >> run
> >> WARNING: Overseer cannot talk to ZK
> >>
> >> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
> >> SEVERE: There was a problem finding the leader in
> >> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>  at
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >> at
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
> >>  Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
> >> SEVERE: There was a problem finding the leader in
> >> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>  at
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >> at
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
> >>  Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
> >> SEVERE: There was a problem making a request to the
> >> leader:org.apache.solr.common.SolrException: I was asked to wait on
> state
> >> down for <001>:8080_solr but I still do not see the request state. I see
> >> state: active live:true
> >>  at
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>  Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
> >> SEVERE: There was a problem making a request to the
> >> leader:org.apache.solr.common.SolrException: I was asked to wait on
> state
> >> down for <001>:8080_solr but I still do not see the request state. I see
> >> state: active live:true
> >>  at
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >> ....
> >>  ....
> >> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
> >> SEVERE: There was a problem finding the leader in
> >> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>  at
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >> ....
> >>  ....
> >> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
> >> SEVERE: :org.apache.solr.common.SolrException: There was a problem
> >> finding the leader in zk
> >>  at
> >>
> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
> >> at
> >>
> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
> >>  Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
> >> SEVERE: Error getting leader from zk
> >> org.apache.solr.common.SolrException: *There is conflicting information
> >> about the leader of shard: shard1 our state says:http://
> <001>:8080/solr/core1/
> >> but zookeeper says:http://<003>:8080/solr/core1/*
> >> * at
> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
> >> * at org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
> >>  Dec 4, 2012 12:22:30 PM
> >> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
> >> INFO: Running the leader process.
> >>  ....
> >> ....
> >>
> >> Thanks for your inputs.
> >> Sudhakar.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmil...@gmail.com
> >wrote:
> >>
> >>> Yes - it means that 001 went down (or more likely had it's connection
> to
> >>> ZooKeeper interrupted? that's what I mean about a session timeout - if
> the
> >>> solr->zk link is broken for longer than the session timeout that will
> >>> trigger a leader election and when the connection is reestablished, the
> >>> node will have to recover). That waiting should stop as soon as 001
> came
> >>> back up or reconnected to ZooKeeper.
> >>>
> >>> In fact, this waiting should not happen in this case - but only on
> >>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming
> very
> >>> soon!):
> >>>
> >>> * SOLR-3940: Rejoining the leader election incorrectly triggers the
> code
> >>> path
> >>>   for a fresh cluster start rather than fail over. (Mark Miller)
> >>>
> >>> - Mark
> >>>
> >>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <maddineni...@gmail.com
> >
> >>> wrote:
> >>>
> >>> > Yep, after restarting, cluster came back to normal state.We will run
> >>> couple of more tests and see if we could reproduce this issue.
> >>> >
> >>> > Btw, I am attaching the server logs before that 'INFO: Waiting until
> >>> we see more replicas'  message.From the logs, we can see that leader
> >>> election process started on 003 which was the replica for 001
> >>> initially.That means leader 001 went down at that time?
> >>> >
> >>> > logs on 003:
> >>> > ========
> >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> runLeaderProcess
> >>> >         INFO: Running the leader process.
> >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> shouldIBeLeader
> >>> >         INFO: Checking if I should try and be the leader.
> >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> shouldIBeLeader
> >>> >         INFO: My last published State was Active, it's okay to be the
> >>> leader.
> >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> runLeaderProcess
> >>> >         INFO: I may be the new leader - try and sync
> >>> > 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
> >>> >         WARNING: Stopping recovery for
> zkNodeName=<003>:8080_solr_core
> >>> core=core1.
> >>> > 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
> >>> >         INFO: Sync replicas to http://<003>:8080/solr/core1/
> >>> > 12:11:16 PM org.apache.solr.update.PeerSync sync
> >>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> >>> replicas=[<001>:8080/solr/core1/] nUpdates=100
> >>> > 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
> >>> >         INFO: Updating live nodes -> this message is on 002
> >>> > 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
> >>> >         WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
> >>>  exception talking to <001>:8080/solr/core1/, failed
> >>> >         org.apache.solr.client.solrj.SolrServerException: Timeout
> >>> occured while waiting response from server at: <001>:8080/solr/core1
> >>> >               at
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
> >>> >               at
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>> >               at
> >>>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
> >>> >               at
> >>>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
> >>> >               at
> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> >>> Source)
> >>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
> >>> >               at
> >>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> >>> >               at
> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> >>> Source)
> >>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
> >>> >               at
> >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
> >>> >               at
> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> >>> >               at java.lang.Thread.run(Unknown Source)
> >>> >         Caused by: java.net.SocketTimeoutException: Read timed out
> >>> >               at java.net.SocketInputStream.socketRead0(Native
> Method)
> >>> >               at java.net.SocketInputStream.read(Unknown Source)
> >>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
> >>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
> >>> sync failed
> >>> > 12:11:46 PM org.apache.solr.common.SolrException log
> >>> >         SEVERE: Sync Failed
> >>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> rejoinLeaderElection
> >>> >         INFO: There is a better leader candidate than us - going back
> >>> into recovery
> >>> > 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
> >>> >         INFO: Running recovery - first canceling any ongoing recovery
> >>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
> >>> >         INFO: Starting recovery process.  core=core1
> >>> recoveringAfterStartup=false
> >>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> >>> >         INFO: Attempting to PeerSync from <001>:8080/solr/core1/
> >>> core=core1 - recoveringAfterStartup=false
> >>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
> >>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> >>> replicas=[<001>:8080/solr/core1/] nUpdates=100
> >>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> runLeaderProcess
> >>> >         INFO: Running the leader process.
> >>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> waitForReplicasToComeUp
> >>> >         INFO: Waiting until we see more replicas up: total=2 found=1
> >>> timeoutin=179999
> >>> > 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> waitForReplicasToComeUp
> >>> >         INFO: Waiting until we see more replicas up: total=2 found=1
> >>> timeoutin=179495
> >>> > 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> waitForReplicasToComeUp
> >>> >         INFO: Waiting until we see more replicas up: total=2 found=1
> >>> timeoutin=178985
> >>> > ....
> >>> > ....
> >>> >
> >>> > Thanks for your help.
> >>> > Sudhakar.
> >>> >
> >>> > On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <markrmil...@gmail.com>
> >>> wrote:
> >>> > The waiting logging had to happen on restart unless it's some kind of
> >>> bug.
> >>> >
> >>> > Beyond that, something is off, but I have no clue why - it seems your
> >>> clusterstate.json is not up to date at all.
> >>> >
> >>> > Have you tried restarting the cluster then? Does that help at all?
> >>> >
> >>> > Do you see any exceptions around zookeeper session timeouts?
> >>> >
> >>> > - Mark
> >>> >
> >>> > On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <
> maddineni...@gmail.com>
> >>> wrote:
> >>> >
> >>> > > Hey Mark,
> >>> > >
> >>> > > Yes, I am able to access all of the nodes under each shard from
> >>> solrcloud
> >>> > > admin UI.
> >>> > >
> >>> > >
> >>> > >   - *It kind of looks like the urls solrcloud is using are not
> >>> accessible.
> >>> > >   When you go to the admin page and the cloud tab, can you access
> >>> the urls it
> >>> > >   shows for each shard? That is, if you click on of the links or
> >>> copy and
> >>> > >   paste the address into a web browser, does it work?*
> >>> > >
> >>> > > Actually, I got these errors when my document upload task/job was
> >>> running,
> >>> > > not during the cluster restart. Also,job ran fine initially for the
> >>> first
> >>> > > one hour and started throwing these errors after indexing some
> docx.
> >>> > >
> >>> > > Thx, Sudhakar.
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmil...@gmail.com
> >
> >>> wrote:
> >>> > >
> >>> > >> It kind of looks like the urls solrcloud is using are not
> >>> accessible. When
> >>> > >> you go to the admin page and the cloud tab, can you access the
> urls
> >>> it
> >>> > >> shows for each shard? That is, if you click on of the links or
> copy
> >>> and
> >>> > >> paste the address into a web browser, does it work?
> >>> > >>
> >>> > >> You may have to explicitly set the host= in solr.xml if it's not
> >>> auto
> >>> > >> detecting the right one. Make sure the ports like right too.
> >>> > >>
> >>> > >>> waitForReplicasToComeUp
> >>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>> > >>> timeoutin=179999
> >>> > >>
> >>> > >> That happens when you stop the cluster and try to start it again -
> >>> before
> >>> > >> a leader is chosen, it will wait for all known replicas fora shard
> >>> to come
> >>> > >> up so that everyone can sync up and have a chance to be the best
> >>> leader. So
> >>> > >> at this point it was only finding one of 2 known replicas and
> >>> waiting for
> >>> > >> the second to come up. After a couple minutes (configurable) it
> >>> will just
> >>> > >> continue anyway without the missing replica (if it doesn't show
> up).
> >>> > >>
> >>> > >> - Mark
> >>> > >>
> >>> > >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
> >>> maddineni...@gmail.com>
> >>> > >> wrote:
> >>> > >>
> >>> > >>> Hi,
> >>> > >>> We are uploading solr documents to the index in batches using 30
> >>> threads
> >>> > >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit
> >>> set to
> >>> > >>> 10000.
> >>> > >>> In the code, we are using HttpSolrServer and add(inputDoc) method
> >>> to add
> >>> > >>> docx.
> >>> > >>> And, we have the following commit settings in solrconfig:
> >>> > >>>
> >>> > >>>    <autoCommit>
> >>> > >>>      <maxTime>300000</maxTime>
> >>> > >>>      <maxDocs>10000</maxDocs>
> >>> > >>>      <openSearcher>false</openSearcher>
> >>> > >>>    </autoCommit>
> >>> > >>>
> >>> > >>>      <autoSoftCommit>
> >>> > >>>        <maxTime>1000</maxTime>
> >>> > >>>      </autoSoftCommit>
> >>> > >>>
> >>> > >>> Cluster Details:
> >>> > >>> ----------------------------
> >>> > >>> solr version - 4.0
> >>> > >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> >>> > >>> numshards=2 ,
> >>> > >>> 001, 002, 003 are the solr nodes and these three are behind the
> >>> > >>> loadbalancer  <vip>
> >>> > >>> 001, 003 assigned to shard1; 002 assigned to shard2
> >>> > >>>
> >>> > >>>
> >>> > >>> Logs:Getting the errors in the below sequence after uploading
> some
> >>> docx:
> >>> > >>>
> >>> > >>
> >>>
> -----------------------------------------------------------------------------------------------------------
> >>> > >>> 003
> >>> > >>> Dec 4, 2012 12:11:46 PM
> >>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>> > >>> waitForReplicasToComeUp
> >>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>> > >>> timeoutin=179999
> >>> > >>>
> >>> > >>> 001
> >>> > >>> Dec 4, 2012 12:12:59 PM
> >>> > >>> org.apache.solr.update.processor.DistributedUpdateProcessor
> >>> > >>> doDefensiveChecks
> >>> > >>> SEVERE: ClusterState says we are the leader, but locally we don't
> >>> think
> >>> > >> so
> >>> > >>>
> >>> > >>> 003
> >>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> >>> > >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
> >>> retrying ...
> >>> > >>>
> >>> > >>> 001
> >>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> >>> > >>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
> >>> Server at
> >>> > >>> <vip>/solr/core1. returned non ok status:503, message:Service
> >>> Unavailable
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>> > >>> 001
> >>> > >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
> >>> > >>> SEVERE: Error while trying to recover.
> >>> > >>> core=core1:org.apache.solr.common.SolrException: We are not the
> >>> leader
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>> > >>>
> >>> > >>> 001
> >>> > >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
> >>> > >>> SEVERE: Error uploading:
> >>> > >> org.apache.solr.client.solrj.SolrServerException:
> >>> > >>> IOException occured when talking to server at <vip>/solr/core1
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>> > >>> at
> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> >>> > >>> ... 5 lines omitted ...
> >>> > >>> at java.lang.Thread.run(Unknown Source)
> >>> > >>> Caused by: java.net.SocketException: Connection reset
> >>> > >>>
> >>> > >>>
> >>> > >>> After sometime, all the three servers are going down.
> >>> > >>>
> >>> > >>> Appreciate, if someone could let us know what we are missing.
> >>> > >>>
> >>> > >>> Thx,Sudhakar.
> >>> > >>
> >>> > >>
> >>> >
> >>> >
> >>> > <logs_error.txt>
> >>>
> >>>
> >>
> >
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Reply via email to