Re: SolrCloud - ClusterState says we are the leader,but locally ...

Marcin Rzewucki Sun, 03 Feb 2013 12:07:57 -0800

I'm loading in batches. 10 threads are reading json files and load to Solr
by sending POST request (from couple of dozens to couple of hundreds docs
in 1 request). I had 1MB post request size, but when I changed it to 10MB
errors disappeared. I guess this could be the reason.
Regards.



On 3 February 2013 20:55, Mark Miller <markrmil...@gmail.com> wrote:

> What led you to trying that? I'm not connecting the dots in my head - the
> exception and the solution.
>
> - Mark
>
> On Feb 3, 2013, at 2:48 PM, Marcin Rzewucki <mrzewu...@gmail.com> wrote:
>
> > Hi,
> >
> > I think the issue was not in zk client timeout, but POST request size.
> When
> > I increased the value for Request.maxFormContentSize in jetty.xml I don't
> > see this issue any more.
> >
> > Regards.
> >
> > On 3 February 2013 01:56, Mark Miller <markrmil...@gmail.com> wrote:
> >
> >> Do you see anything about session expiration in the logs? That is the
> >> likely culprit for something like this. You may need to raise the
> timeout:
> >> http://wiki.apache.org/solr/SolrCloud#FAQ
> >>
> >> If you see no session timeouts, I don't have a guess yet.
> >>
> >> - Mark
> >>
> >> On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki <mrzewu...@gmail.com>
> wrote:
> >>
> >>> I'm experiencing same problem in Solr4.1 during bulk loading. After 50
> >>> minutes of indexing the following error starts to occur:
> >>>
> >>> INFO: [core] webapp=/solr path=/update params={} {} 0 4
> >>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
> >>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
> >> the
> >>> leader, but locally we don't think so
> >>>       at
> >>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
> >>>       at
> >>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
> >>>       at
> >>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
> >>>       at
> >>>
> >>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
> >>>       at
> >>> org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
> >>>       at
> >>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
> >>>       at
> >>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
> >>>       at
> >>>
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> >>>       at org.eclipse.jetty.server.Server.handle(Server.java:365)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
> >>>       at
> >> org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
> >>>       at
> >>> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> >>>       at java.lang.Thread.run(Unknown Source)
> >>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
> >>> Feb 02, 2013 11:36:31 PM
> org.apache.solr.cloud.ShardLeaderElectionContext
> >>> waitForReplicasToComeUp
> >>> INFO: Waiting until we see more replicas up: total=2 found=1
> >> timeoutin=50699
> >>>
> >>> Then leader tries to sync with replica and after it finishes I can
> >> continue
> >>> loading.
> >>> None of SolrCloud nodes was restarted during that time. I don't
> remember
> >>> such behaviour in Solr4.0. Could it be related with the number of
> fields
> >>> indexed during loading ? I have a collection with about 2400 fields. I
> >>> can't reproduce same issue for other collections with much less fields
> >> per
> >>> record.
> >>> Regards.
> >>>
> >>> On 11 December 2012 19:50, Sudhakar Maddineni <maddineni...@gmail.com
> >>> wrote:
> >>>
> >>>> Just an update on this issue:
> >>>>  We tried by increasing zookeeper client timeout settings to 30000ms
> in
> >>>> solr.xml (i think default is 15000ms), and haven't seen any issues
> from
> >> our
> >>>> tests.
> >>>> <cores .........           zkClientTimeout="30000" >
> >>>>
> >>>> Thanks, Sudhakar.
> >>>>
> >>>> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
> >>>> <maddineni...@gmail.com>wrote:
> >>>>
> >>>>> We saw this error again today during our load test - basically,
> >> whenever
> >>>>> session is getting expired on the leader node, we are seeing the
> >>>>> error.After this happens, leader(001) is going into 'recovery' mode
> and
> >>>> all
> >>>>> the index updates are failing with "503- service unavailable" error
> >>>>> message.After some time(once recovery is successful), roles are
> swapped
> >>>>> i.e. 001 acting as the replica and 003 as leader.
> >>>>>
> >>>>> Btw, do you know why the connection to zookeeper[solr->zk] getting
> >>>>> interrupted in the middle?
> >>>>> is it because of the load(no of updates) we are putting on the
> cluster?
> >>>>>
> >>>>> Here is the exception stack trace:
> >>>>>
> >>>>> *Dec* *7*, *2012* *2:28:03* *PM*
> >>>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
> >>>>> *WARNING:*
> >>>> *org.apache.zookeeper.KeeperException$SessionExpiredException:*
> >>>> *KeeperErrorCode* *=* *Session* *expired* *for*
> */overseer_elect/leader*
> >>>>>     *at*
> >>>>
> >>
> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
> >>>>>     *at*
> >>>>
> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
> >>>>>     *at*
> *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
> >>>>> )
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
> >>>>>     *at* *java.lang.Thread.run*(*Unknown* *Source*)
> >>>>>
> >>>>> Thx,Sudhakar.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <
> >>>> maddineni...@gmail.com
> >>>>>> wrote:
> >>>>>
> >>>>>> Erick:
> >>>>>> Not seeing any page caching related issues...
> >>>>>>
> >>>>>> Mark:
> >>>>>> 1.Would this "waiting" on 003(replica) cause any inconsistencies in
> >>>> the
> >>>>>> zookeeper cluster state? I was also looking at the leader(001) logs
> at
> >>>> that
> >>>>>> time and seeing errors related to "*SEVERE: ClusterState says we are
> >> the
> >>>>>> leader, but locally we don't think so*".
> >>>>>> 2.Also, all of our servers in cluster were gone down when the index
> >>>>>> updates were running in parallel along with this issue.Do you see
> this
> >>>>>> related to the session expiry on 001?
> >>>>>>
> >>>>>> Here are the logs on 001
> >>>>>> =========================
> >>>>>>
> >>>>>> Dec 4, 2012 12:12:29 PM
> >>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
> >>>>>> WARNING:
> >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>> KeeperErrorCode = Session expired for /overseer_elect/leader
> >>>>>> at
> >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
> >>>>>> at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> >>>>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
> >>>>>> Dec 4, 2012 12:12:29 PM
> >>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
> >>>>>> INFO: According to ZK I
> >>>>>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a
> >>>> leader.
> >>>>>>
> >>>>>> Dec 4, 2012 12:12:29 PM
> >>>> org.apache.solr.cloud.OverseerCollectionProcessor
> >>>>>> run
> >>>>>> WARNING: Overseer cannot talk to ZK
> >>>>>>
> >>>>>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: There was a problem finding the leader in
> >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>>>>> at
> >>>>>>
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >>>>>> at
> >>>>>>
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
> >>>>>> Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: There was a problem finding the leader in
> >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>>>>> at
> >>>>>>
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >>>>>> at
> >>>>>>
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
> >>>>>> Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: There was a problem making a request to the
> >>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
> >>>> state
> >>>>>> down for <001>:8080_solr but I still do not see the request state. I
> >> see
> >>>>>> state: active live:true
> >>>>>> at
> >>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>>>>> Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: There was a problem making a request to the
> >>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
> >>>> state
> >>>>>> down for <001>:8080_solr but I still do not see the request state. I
> >> see
> >>>>>> state: active live:true
> >>>>>> at
> >>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>>>>> ....
> >>>>>> ....
> >>>>>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: There was a problem finding the leader in
> >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>>>>> at
> >>>>>>
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >>>>>> ....
> >>>>>> ....
> >>>>>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: :org.apache.solr.common.SolrException: There was a problem
> >>>>>> finding the leader in zk
> >>>>>> at
> >>>>>>
> >>>>
> >>
> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
> >>>>>> at
> >>>>>>
> >>>>
> >>
> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
> >>>>>> Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
> >>>>>> SEVERE: Error getting leader from zk
> >>>>>> org.apache.solr.common.SolrException: *There is conflicting
> >> information
> >>>>>> about the leader of shard: shard1 our state says:http://
> >>>> <001>:8080/solr/core1/
> >>>>>> but zookeeper says:http://<003>:8080/solr/core1/*
> >>>>>> * at
> >>>> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
> >>>>>> * at
> >> org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
> >>>>>> Dec 4, 2012 12:22:30 PM
> >>>>>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
> >>>>>> INFO: Running the leader process.
> >>>>>> ....
> >>>>>> ....
> >>>>>>
> >>>>>> Thanks for your inputs.
> >>>>>> Sudhakar.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmil...@gmail.com
> >>>>> wrote:
> >>>>>>
> >>>>>>> Yes - it means that 001 went down (or more likely had it's
> connection
> >>>> to
> >>>>>>> ZooKeeper interrupted? that's what I mean about a session timeout -
> >> if
> >>>> the
> >>>>>>> solr->zk link is broken for longer than the session timeout that
> will
> >>>>>>> trigger a leader election and when the connection is reestablished,
> >> the
> >>>>>>> node will have to recover). That waiting should stop as soon as 001
> >>>> came
> >>>>>>> back up or reconnected to ZooKeeper.
> >>>>>>>
> >>>>>>> In fact, this waiting should not happen in this case - but only on
> >>>>>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully
> coming
> >>>> very
> >>>>>>> soon!):
> >>>>>>>
> >>>>>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the
> >>>> code
> >>>>>>> path
> >>>>>>> for a fresh cluster start rather than fail over. (Mark Miller)
> >>>>>>>
> >>>>>>> - Mark
> >>>>>>>
> >>>>>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <
> >> maddineni...@gmail.com
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Yep, after restarting, cluster came back to normal state.We will
> run
> >>>>>>> couple of more tests and see if we could reproduce this issue.
> >>>>>>>>
> >>>>>>>> Btw, I am attaching the server logs before that 'INFO: Waiting
> until
> >>>>>>> we see more replicas'  message.From the logs, we can see that
> leader
> >>>>>>> election process started on 003 which was the replica for 001
> >>>>>>> initially.That means leader 001 went down at that time?
> >>>>>>>>
> >>>>>>>> logs on 003:
> >>>>>>>> ========
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> runLeaderProcess
> >>>>>>>>       INFO: Running the leader process.
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> shouldIBeLeader
> >>>>>>>>       INFO: Checking if I should try and be the leader.
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> shouldIBeLeader
> >>>>>>>>       INFO: My last published State was Active, it's okay to be
> the
> >>>>>>> leader.
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> runLeaderProcess
> >>>>>>>>       INFO: I may be the new leader - try and sync
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
> >>>>>>>>       WARNING: Stopping recovery for
> >>>> zkNodeName=<003>:8080_solr_core
> >>>>>>> core=core1.
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
> >>>>>>>>       INFO: Sync replicas to http://<003>:8080/solr/core1/
> >>>>>>>> 12:11:16 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> >>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
> >>>>>>>> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
> >>>>>>>>       INFO: Updating live nodes -> this message is on 002
> >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
> >>>>>>>>       WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
> >>>>>>> exception talking to <001>:8080/solr/core1/, failed
> >>>>>>>>       org.apache.solr.client.solrj.SolrServerException: Timeout
> >>>>>>> occured while waiting response from server at:
> <001>:8080/solr/core1
> >>>>>>>>             at
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
> >>>>>>>>             at
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>             at
> >>>>>>>
> >>>>
> >>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
> >>>>>>>>             at
> >>>>>>>
> >>>>
> >>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
> >>>>>>>>             at
> >>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> >>>>>>> Source)
> >>>>>>>>             at java.util.concurrent.FutureTask.run(Unknown Source)
> >>>>>>>>             at
> >>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> >>>>>>>>             at
> >>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> >>>>>>> Source)
> >>>>>>>>             at java.util.concurrent.FutureTask.run(Unknown Source)
> >>>>>>>>             at
> >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> >> Source)
> >>>>>>>>             at
> >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> >>>>>>>>             at java.lang.Thread.run(Unknown Source)
> >>>>>>>>       Caused by: java.net.SocketTimeoutException: Read timed out
> >>>>>>>>             at java.net.SocketInputStream.socketRead0(Native
> >>>> Method)
> >>>>>>>>             at java.net.SocketInputStream.read(Unknown Source)
> >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
> >>>>>>> sync failed
> >>>>>>>> 12:11:46 PM org.apache.solr.common.SolrException log
> >>>>>>>>       SEVERE: Sync Failed
> >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> rejoinLeaderElection
> >>>>>>>>       INFO: There is a better leader candidate than us - going
> back
> >>>>>>> into recovery
> >>>>>>>> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
> >>>>>>>>       INFO: Running recovery - first canceling any ongoing
> recovery
> >>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
> >>>>>>>>       INFO: Starting recovery process.  core=core1
> >>>>>>> recoveringAfterStartup=false
> >>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> >>>>>>>>       INFO: Attempting to PeerSync from <001>:8080/solr/core1/
> >>>>>>> core=core1 - recoveringAfterStartup=false
> >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> >>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
> >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> runLeaderProcess
> >>>>>>>>       INFO: Running the leader process.
> >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> waitForReplicasToComeUp
> >>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>> timeoutin=179999
> >>>>>>>> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> waitForReplicasToComeUp
> >>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>> timeoutin=179495
> >>>>>>>> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> waitForReplicasToComeUp
> >>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>> timeoutin=178985
> >>>>>>>> ....
> >>>>>>>> ....
> >>>>>>>>
> >>>>>>>> Thanks for your help.
> >>>>>>>> Sudhakar.
> >>>>>>>>
> >>>>>>>> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <
> markrmil...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>> The waiting logging had to happen on restart unless it's some kind
> >> of
> >>>>>>> bug.
> >>>>>>>>
> >>>>>>>> Beyond that, something is off, but I have no clue why - it seems
> >> your
> >>>>>>> clusterstate.json is not up to date at all.
> >>>>>>>>
> >>>>>>>> Have you tried restarting the cluster then? Does that help at all?
> >>>>>>>>
> >>>>>>>> Do you see any exceptions around zookeeper session timeouts?
> >>>>>>>>
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <
> >>>> maddineni...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Mark,
> >>>>>>>>>
> >>>>>>>>> Yes, I am able to access all of the nodes under each shard from
> >>>>>>> solrcloud
> >>>>>>>>> admin UI.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> - *It kind of looks like the urls solrcloud is using are not
> >>>>>>> accessible.
> >>>>>>>>> When you go to the admin page and the cloud tab, can you access
> >>>>>>> the urls it
> >>>>>>>>> shows for each shard? That is, if you click on of the links or
> >>>>>>> copy and
> >>>>>>>>> paste the address into a web browser, does it work?*
> >>>>>>>>>
> >>>>>>>>> Actually, I got these errors when my document upload task/job was
> >>>>>>> running,
> >>>>>>>>> not during the cluster restart. Also,job ran fine initially for
> the
> >>>>>>> first
> >>>>>>>>> one hour and started throwing these errors after indexing some
> >>>> docx.
> >>>>>>>>>
> >>>>>>>>> Thx, Sudhakar.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <
> markrmil...@gmail.com
> >>>>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> It kind of looks like the urls solrcloud is using are not
> >>>>>>> accessible. When
> >>>>>>>>>> you go to the admin page and the cloud tab, can you access the
> >>>> urls
> >>>>>>> it
> >>>>>>>>>> shows for each shard? That is, if you click on of the links or
> >>>> copy
> >>>>>>> and
> >>>>>>>>>> paste the address into a web browser, does it work?
> >>>>>>>>>>
> >>>>>>>>>> You may have to explicitly set the host= in solr.xml if it's not
> >>>>>>> auto
> >>>>>>>>>> detecting the right one. Make sure the ports like right too.
> >>>>>>>>>>
> >>>>>>>>>>> waitForReplicasToComeUp
> >>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>>>>>> timeoutin=179999
> >>>>>>>>>>
> >>>>>>>>>> That happens when you stop the cluster and try to start it
> again -
> >>>>>>> before
> >>>>>>>>>> a leader is chosen, it will wait for all known replicas fora
> shard
> >>>>>>> to come
> >>>>>>>>>> up so that everyone can sync up and have a chance to be the best
> >>>>>>> leader. So
> >>>>>>>>>> at this point it was only finding one of 2 known replicas and
> >>>>>>> waiting for
> >>>>>>>>>> the second to come up. After a couple minutes (configurable) it
> >>>>>>> will just
> >>>>>>>>>> continue anyway without the missing replica (if it doesn't show
> >>>> up).
> >>>>>>>>>>
> >>>>>>>>>> - Mark
> >>>>>>>>>>
> >>>>>>>>>> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
> >>>>>>> maddineni...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>> We are uploading solr documents to the index in batches using
> 30
> >>>>>>> threads
> >>>>>>>>>>> and using ThreadPoolExecutor, LinkedBlockingQueue with max
> limit
> >>>>>>> set to
> >>>>>>>>>>> 10000.
> >>>>>>>>>>> In the code, we are using HttpSolrServer and add(inputDoc)
> method
> >>>>>>> to add
> >>>>>>>>>>> docx.
> >>>>>>>>>>> And, we have the following commit settings in solrconfig:
> >>>>>>>>>>>
> >>>>>>>>>>>  <autoCommit>
> >>>>>>>>>>>    <maxTime>300000</maxTime>
> >>>>>>>>>>>    <maxDocs>10000</maxDocs>
> >>>>>>>>>>>    <openSearcher>false</openSearcher>
> >>>>>>>>>>>  </autoCommit>
> >>>>>>>>>>>
> >>>>>>>>>>>    <autoSoftCommit>
> >>>>>>>>>>>      <maxTime>1000</maxTime>
> >>>>>>>>>>>    </autoSoftCommit>
> >>>>>>>>>>>
> >>>>>>>>>>> Cluster Details:
> >>>>>>>>>>> ----------------------------
> >>>>>>>>>>> solr version - 4.0
> >>>>>>>>>>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> >>>>>>>>>>> numshards=2 ,
> >>>>>>>>>>> 001, 002, 003 are the solr nodes and these three are behind the
> >>>>>>>>>>> loadbalancer  <vip>
> >>>>>>>>>>> 001, 003 assigned to shard1; 002 assigned to shard2
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Logs:Getting the errors in the below sequence after uploading
> >>>> some
> >>>>>>> docx:
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> -----------------------------------------------------------------------------------------------------------
> >>>>>>>>>>> 003
> >>>>>>>>>>> Dec 4, 2012 12:11:46 PM
> >>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>> waitForReplicasToComeUp
> >>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>>>>>> timeoutin=179999
> >>>>>>>>>>>
> >>>>>>>>>>> 001
> >>>>>>>>>>> Dec 4, 2012 12:12:59 PM
> >>>>>>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor
> >>>>>>>>>>> doDefensiveChecks
> >>>>>>>>>>> SEVERE: ClusterState says we are the leader, but locally we
> don't
> >>>>>>> think
> >>>>>>>>>> so
> >>>>>>>>>>>
> >>>>>>>>>>> 003
> >>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException
> log
> >>>>>>>>>>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
> >>>>>>> retrying ...
> >>>>>>>>>>>
> >>>>>>>>>>> 001
> >>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException
> log
> >>>>>>>>>>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
> >>>>>>> Server at
> >>>>>>>>>>> <vip>/solr/core1. returned non ok status:503, message:Service
> >>>>>>> Unavailable
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>> 001
> >>>>>>>>>>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException
> log
> >>>>>>>>>>> SEVERE: Error while trying to recover.
> >>>>>>>>>>> core=core1:org.apache.solr.common.SolrException: We are not the
> >>>>>>> leader
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>>>>>>>>>>
> >>>>>>>>>>> 001
> >>>>>>>>>>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException
> log
> >>>>>>>>>>> SEVERE: Error uploading:
> >>>>>>>>>> org.apache.solr.client.solrj.SolrServerException:
> >>>>>>>>>>> IOException occured when talking to server at <vip>/solr/core1
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>>>>>>>>>> at
> >>>> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> >>>>>>>>>>> ... 5 lines omitted ...
> >>>>>>>>>>> at java.lang.Thread.run(Unknown Source)
> >>>>>>>>>>> Caused by: java.net.SocketException: Connection reset
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> After sometime, all the three servers are going down.
> >>>>>>>>>>>
> >>>>>>>>>>> Appreciate, if someone could let us know what we are missing.
> >>>>>>>>>>>
> >>>>>>>>>>> Thx,Sudhakar.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> <logs_error.txt>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Reply via email to