Re: SolrCloud - ClusterState says we are the leader,but locally ...

Mark Miller Sat, 02 Feb 2013 16:57:13 -0800

Do you see anything about session expiration in the logs? That is the likely 
culprit for something like this. You may need to raise the timeout: 
http://wiki.apache.org/solr/SolrCloud#FAQ


If you see no session timeouts, I don't have a guess yet.

- Mark

On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki <mrzewu...@gmail.com> wrote:

> I'm experiencing same problem in Solr4.1 during bulk loading. After 50
> minutes of indexing the following error starts to occur:
> 
> INFO: [core] webapp=/solr path=/update params={} {} 0 4
> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the
> leader, but locally we don't think so
>        at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
>        at
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
>        at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
>        at
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>        at
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
>        at
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
>        at
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
>        at
> org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
>        at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>        at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
>        at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>        at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>        at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>        at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>        at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>        at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
>        at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
>        at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
>        at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
>        at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>        at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>        at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
>        at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>        at org.eclipse.jetty.server.Server.handle(Server.java:365)
>        at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
>        at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
>        at
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
>        at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
>        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
>        at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>        at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>        at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>        at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>        at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>        at java.lang.Thread.run(Unknown Source)
> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
> Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=50699
> 
> Then leader tries to sync with replica and after it finishes I can continue
> loading.
> None of SolrCloud nodes was restarted during that time. I don't remember
> such behaviour in Solr4.0. Could it be related with the number of fields
> indexed during loading ? I have a collection with about 2400 fields. I
> can't reproduce same issue for other collections with much less fields per
> record.
> Regards.
> 
> On 11 December 2012 19:50, Sudhakar Maddineni <maddineni...@gmail.com>wrote:
> 
>> Just an update on this issue:
>>   We tried by increasing zookeeper client timeout settings to 30000ms in
>> solr.xml (i think default is 15000ms), and haven't seen any issues from our
>> tests.
>> <cores .........           zkClientTimeout="30000" >
>> 
>> Thanks, Sudhakar.
>> 
>> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
>> <maddineni...@gmail.com>wrote:
>> 
>>> We saw this error again today during our load test - basically, whenever
>>> session is getting expired on the leader node, we are seeing the
>>> error.After this happens, leader(001) is going into 'recovery' mode and
>> all
>>> the index updates are failing with "503- service unavailable" error
>>> message.After some time(once recovery is successful), roles are swapped
>>> i.e. 001 acting as the replica and 003 as leader.
>>> 
>>> Btw, do you know why the connection to zookeeper[solr->zk] getting
>>> interrupted in the middle?
>>> is it because of the load(no of updates) we are putting on the cluster?
>>> 
>>> Here is the exception stack trace:
>>> 
>>> *Dec* *7*, *2012* *2:28:03* *PM*
>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
>>> *WARNING:*
>> *org.apache.zookeeper.KeeperException$SessionExpiredException:*
>> *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader*
>>>      *at*
>> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
>>>      *at*
>> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
>>>      *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
>>> )
>>>      *at*
>> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
>>>      *at*
>> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
>>>      *at*
>> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
>>>      *at*
>> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
>>>      *at*
>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
>>>      *at*
>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
>>>      *at* *java.lang.Thread.run*(*Unknown* *Source*)
>>> 
>>> Thx,Sudhakar.
>>> 
>>> 
>>> 
>>> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <
>> maddineni...@gmail.com
>>>> wrote:
>>> 
>>>> Erick:
>>>>  Not seeing any page caching related issues...
>>>> 
>>>> Mark:
>>>>  1.Would this "waiting" on 003(replica) cause any inconsistencies in
>> the
>>>> zookeeper cluster state? I was also looking at the leader(001) logs at
>> that
>>>> time and seeing errors related to "*SEVERE: ClusterState says we are the
>>>> leader, but locally we don't think so*".
>>>>  2.Also, all of our servers in cluster were gone down when the index
>>>> updates were running in parallel along with this issue.Do you see this
>>>> related to the session expiry on 001?
>>>> 
>>>> Here are the logs on 001
>>>> =========================
>>>> 
>>>> Dec 4, 2012 12:12:29 PM
>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>>>> WARNING:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired for /overseer_elect/leader
>>>> at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>>>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
>>>> Dec 4, 2012 12:12:29 PM
>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>>>> INFO: According to ZK I
>>>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a
>> leader.
>>>> 
>>>> Dec 4, 2012 12:12:29 PM
>> org.apache.solr.cloud.OverseerCollectionProcessor
>>>> run
>>>> WARNING: Overseer cannot talk to ZK
>>>> 
>>>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
>>>> SEVERE: There was a problem finding the leader in
>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>> at
>>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>> at
>>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>>> Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
>>>> SEVERE: There was a problem finding the leader in
>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>> at
>>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>> at
>>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>>> Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
>>>> SEVERE: There was a problem making a request to the
>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
>> state
>>>> down for <001>:8080_solr but I still do not see the request state. I see
>>>> state: active live:true
>>>> at
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>> Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
>>>> SEVERE: There was a problem making a request to the
>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
>> state
>>>> down for <001>:8080_solr but I still do not see the request state. I see
>>>> state: active live:true
>>>> at
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>> ....
>>>> ....
>>>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
>>>> SEVERE: There was a problem finding the leader in
>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>> at
>>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>> ....
>>>> ....
>>>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
>>>> SEVERE: :org.apache.solr.common.SolrException: There was a problem
>>>> finding the leader in zk
>>>> at
>>>> 
>> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
>>>> at
>>>> 
>> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
>>>> Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
>>>> SEVERE: Error getting leader from zk
>>>> org.apache.solr.common.SolrException: *There is conflicting information
>>>> about the leader of shard: shard1 our state says:http://
>> <001>:8080/solr/core1/
>>>> but zookeeper says:http://<003>:8080/solr/core1/*
>>>> * at
>> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
>>>> * at org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
>>>> Dec 4, 2012 12:22:30 PM
>>>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
>>>> INFO: Running the leader process.
>>>> ....
>>>> ....
>>>> 
>>>> Thanks for your inputs.
>>>> Sudhakar.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmil...@gmail.com
>>> wrote:
>>>> 
>>>>> Yes - it means that 001 went down (or more likely had it's connection
>> to
>>>>> ZooKeeper interrupted? that's what I mean about a session timeout - if
>> the
>>>>> solr->zk link is broken for longer than the session timeout that will
>>>>> trigger a leader election and when the connection is reestablished, the
>>>>> node will have to recover). That waiting should stop as soon as 001
>> came
>>>>> back up or reconnected to ZooKeeper.
>>>>> 
>>>>> In fact, this waiting should not happen in this case - but only on
>>>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming
>> very
>>>>> soon!):
>>>>> 
>>>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the
>> code
>>>>> path
>>>>>  for a fresh cluster start rather than fail over. (Mark Miller)
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <maddineni...@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Yep, after restarting, cluster came back to normal state.We will run
>>>>> couple of more tests and see if we could reproduce this issue.
>>>>>> 
>>>>>> Btw, I am attaching the server logs before that 'INFO: Waiting until
>>>>> we see more replicas'  message.From the logs, we can see that leader
>>>>> election process started on 003 which was the replica for 001
>>>>> initially.That means leader 001 went down at that time?
>>>>>> 
>>>>>> logs on 003:
>>>>>> ========
>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> runLeaderProcess
>>>>>>        INFO: Running the leader process.
>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> shouldIBeLeader
>>>>>>        INFO: Checking if I should try and be the leader.
>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> shouldIBeLeader
>>>>>>        INFO: My last published State was Active, it's okay to be the
>>>>> leader.
>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> runLeaderProcess
>>>>>>        INFO: I may be the new leader - try and sync
>>>>>> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
>>>>>>        WARNING: Stopping recovery for
>> zkNodeName=<003>:8080_solr_core
>>>>> core=core1.
>>>>>> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
>>>>>>        INFO: Sync replicas to http://<003>:8080/solr/core1/
>>>>>> 12:11:16 PM org.apache.solr.update.PeerSync sync
>>>>>>        INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>>>>> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
>>>>>>        INFO: Updating live nodes -> this message is on 002
>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
>>>>>>        WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
>>>>> exception talking to <001>:8080/solr/core1/, failed
>>>>>>        org.apache.solr.client.solrj.SolrServerException: Timeout
>>>>> occured while waiting response from server at: <001>:8080/solr/core1
>>>>>>              at
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
>>>>>>              at
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>              at
>>>>> 
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>>>>>>              at
>>>>> 
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>>>>>>              at
>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>>>> Source)
>>>>>>              at java.util.concurrent.FutureTask.run(Unknown Source)
>>>>>>              at
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>>>>>>              at
>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>>>> Source)
>>>>>>              at java.util.concurrent.FutureTask.run(Unknown Source)
>>>>>>              at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>>>>>>              at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>>>              at java.lang.Thread.run(Unknown Source)
>>>>>>        Caused by: java.net.SocketTimeoutException: Read timed out
>>>>>>              at java.net.SocketInputStream.socketRead0(Native
>> Method)
>>>>>>              at java.net.SocketInputStream.read(Unknown Source)
>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
>>>>>>        INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
>>>>> sync failed
>>>>>> 12:11:46 PM org.apache.solr.common.SolrException log
>>>>>>        SEVERE: Sync Failed
>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> rejoinLeaderElection
>>>>>>        INFO: There is a better leader candidate than us - going back
>>>>> into recovery
>>>>>> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
>>>>>>        INFO: Running recovery - first canceling any ongoing recovery
>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
>>>>>>        INFO: Starting recovery process.  core=core1
>>>>> recoveringAfterStartup=false
>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>>>>>>        INFO: Attempting to PeerSync from <001>:8080/solr/core1/
>>>>> core=core1 - recoveringAfterStartup=false
>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
>>>>>>        INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> runLeaderProcess
>>>>>>        INFO: Running the leader process.
>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> waitForReplicasToComeUp
>>>>>>        INFO: Waiting until we see more replicas up: total=2 found=1
>>>>> timeoutin=179999
>>>>>> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> waitForReplicasToComeUp
>>>>>>        INFO: Waiting until we see more replicas up: total=2 found=1
>>>>> timeoutin=179495
>>>>>> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> waitForReplicasToComeUp
>>>>>>        INFO: Waiting until we see more replicas up: total=2 found=1
>>>>> timeoutin=178985
>>>>>> ....
>>>>>> ....
>>>>>> 
>>>>>> Thanks for your help.
>>>>>> Sudhakar.
>>>>>> 
>>>>>> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <markrmil...@gmail.com>
>>>>> wrote:
>>>>>> The waiting logging had to happen on restart unless it's some kind of
>>>>> bug.
>>>>>> 
>>>>>> Beyond that, something is off, but I have no clue why - it seems your
>>>>> clusterstate.json is not up to date at all.
>>>>>> 
>>>>>> Have you tried restarting the cluster then? Does that help at all?
>>>>>> 
>>>>>> Do you see any exceptions around zookeeper session timeouts?
>>>>>> 
>>>>>> - Mark
>>>>>> 
>>>>>> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <
>> maddineni...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Hey Mark,
>>>>>>> 
>>>>>>> Yes, I am able to access all of the nodes under each shard from
>>>>> solrcloud
>>>>>>> admin UI.
>>>>>>> 
>>>>>>> 
>>>>>>>  - *It kind of looks like the urls solrcloud is using are not
>>>>> accessible.
>>>>>>>  When you go to the admin page and the cloud tab, can you access
>>>>> the urls it
>>>>>>>  shows for each shard? That is, if you click on of the links or
>>>>> copy and
>>>>>>>  paste the address into a web browser, does it work?*
>>>>>>> 
>>>>>>> Actually, I got these errors when my document upload task/job was
>>>>> running,
>>>>>>> not during the cluster restart. Also,job ran fine initially for the
>>>>> first
>>>>>>> one hour and started throwing these errors after indexing some
>> docx.
>>>>>>> 
>>>>>>> Thx, Sudhakar.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmil...@gmail.com
>>> 
>>>>> wrote:
>>>>>>> 
>>>>>>>> It kind of looks like the urls solrcloud is using are not
>>>>> accessible. When
>>>>>>>> you go to the admin page and the cloud tab, can you access the
>> urls
>>>>> it
>>>>>>>> shows for each shard? That is, if you click on of the links or
>> copy
>>>>> and
>>>>>>>> paste the address into a web browser, does it work?
>>>>>>>> 
>>>>>>>> You may have to explicitly set the host= in solr.xml if it's not
>>>>> auto
>>>>>>>> detecting the right one. Make sure the ports like right too.
>>>>>>>> 
>>>>>>>>> waitForReplicasToComeUp
>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>>>> timeoutin=179999
>>>>>>>> 
>>>>>>>> That happens when you stop the cluster and try to start it again -
>>>>> before
>>>>>>>> a leader is chosen, it will wait for all known replicas fora shard
>>>>> to come
>>>>>>>> up so that everyone can sync up and have a chance to be the best
>>>>> leader. So
>>>>>>>> at this point it was only finding one of 2 known replicas and
>>>>> waiting for
>>>>>>>> the second to come up. After a couple minutes (configurable) it
>>>>> will just
>>>>>>>> continue anyway without the missing replica (if it doesn't show
>> up).
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
>>>>> maddineni...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> We are uploading solr documents to the index in batches using 30
>>>>> threads
>>>>>>>>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit
>>>>> set to
>>>>>>>>> 10000.
>>>>>>>>> In the code, we are using HttpSolrServer and add(inputDoc) method
>>>>> to add
>>>>>>>>> docx.
>>>>>>>>> And, we have the following commit settings in solrconfig:
>>>>>>>>> 
>>>>>>>>>   <autoCommit>
>>>>>>>>>     <maxTime>300000</maxTime>
>>>>>>>>>     <maxDocs>10000</maxDocs>
>>>>>>>>>     <openSearcher>false</openSearcher>
>>>>>>>>>   </autoCommit>
>>>>>>>>> 
>>>>>>>>>     <autoSoftCommit>
>>>>>>>>>       <maxTime>1000</maxTime>
>>>>>>>>>     </autoSoftCommit>
>>>>>>>>> 
>>>>>>>>> Cluster Details:
>>>>>>>>> ----------------------------
>>>>>>>>> solr version - 4.0
>>>>>>>>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
>>>>>>>>> numshards=2 ,
>>>>>>>>> 001, 002, 003 are the solr nodes and these three are behind the
>>>>>>>>> loadbalancer  <vip>
>>>>>>>>> 001, 003 assigned to shard1; 002 assigned to shard2
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Logs:Getting the errors in the below sequence after uploading
>> some
>>>>> docx:
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> -----------------------------------------------------------------------------------------------------------
>>>>>>>>> 003
>>>>>>>>> Dec 4, 2012 12:11:46 PM
>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>> waitForReplicasToComeUp
>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>>>> timeoutin=179999
>>>>>>>>> 
>>>>>>>>> 001
>>>>>>>>> Dec 4, 2012 12:12:59 PM
>>>>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor
>>>>>>>>> doDefensiveChecks
>>>>>>>>> SEVERE: ClusterState says we are the leader, but locally we don't
>>>>> think
>>>>>>>> so
>>>>>>>>> 
>>>>>>>>> 003
>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>>>>>>>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
>>>>> retrying ...
>>>>>>>>> 
>>>>>>>>> 001
>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>>>>>>>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
>>>>> Server at
>>>>>>>>> <vip>/solr/core1. returned non ok status:503, message:Service
>>>>> Unavailable
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>> 001
>>>>>>>>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
>>>>>>>>> SEVERE: Error while trying to recover.
>>>>>>>>> core=core1:org.apache.solr.common.SolrException: We are not the
>>>>> leader
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>>>>>>> 
>>>>>>>>> 001
>>>>>>>>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
>>>>>>>>> SEVERE: Error uploading:
>>>>>>>> org.apache.solr.client.solrj.SolrServerException:
>>>>>>>>> IOException occured when talking to server at <vip>/solr/core1
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>>>>>>>> at
>> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
>>>>>>>>> ... 5 lines omitted ...
>>>>>>>>> at java.lang.Thread.run(Unknown Source)
>>>>>>>>> Caused by: java.net.SocketException: Connection reset
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> After sometime, all the three servers are going down.
>>>>>>>>> 
>>>>>>>>> Appreciate, if someone could let us know what we are missing.
>>>>>>>>> 
>>>>>>>>> Thx,Sudhakar.
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> <logs_error.txt>
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Reply via email to