I'm loading in batches. 10 threads are reading json files and load to Solr by sending POST request (from couple of dozens to couple of hundreds docs in 1 request). I had 1MB post request size, but when I changed it to 10MB errors disappeared. I guess this could be the reason. Regards.
On 3 February 2013 20:55, Mark Miller <markrmil...@gmail.com> wrote: > What led you to trying that? I'm not connecting the dots in my head - the > exception and the solution. > > - Mark > > On Feb 3, 2013, at 2:48 PM, Marcin Rzewucki <mrzewu...@gmail.com> wrote: > > > Hi, > > > > I think the issue was not in zk client timeout, but POST request size. > When > > I increased the value for Request.maxFormContentSize in jetty.xml I don't > > see this issue any more. > > > > Regards. > > > > On 3 February 2013 01:56, Mark Miller <markrmil...@gmail.com> wrote: > > > >> Do you see anything about session expiration in the logs? That is the > >> likely culprit for something like this. You may need to raise the > timeout: > >> http://wiki.apache.org/solr/SolrCloud#FAQ > >> > >> If you see no session timeouts, I don't have a guess yet. > >> > >> - Mark > >> > >> On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki <mrzewu...@gmail.com> > wrote: > >> > >>> I'm experiencing same problem in Solr4.1 during bulk loading. After 50 > >>> minutes of indexing the following error starts to occur: > >>> > >>> INFO: [core] webapp=/solr path=/update params={} {} 0 4 > >>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log > >>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are > >> the > >>> leader, but locally we don't think so > >>> at > >>> > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295) > >>> at > >>> > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230) > >>> at > >>> > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343) > >>> at > >>> > >> > org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) > >>> at > >>> > >> > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387) > >>> at > >>> > >> > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112) > >>> at > >>> > >> > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96) > >>> at > >>> org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60) > >>> at > >>> > >> > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > >>> at > >>> > >> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > >>> at > >>> > >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) > >>> at > >>> > >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448) > >>> at > >>> > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269) > >>> at > >>> > >> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) > >>> at > >>> > >> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) > >>> at > >>> > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) > >>> at > >>> > >> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) > >>> at > >>> > >> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) > >>> at > >>> > >> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) > >>> at > >>> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) > >>> at > >>> > >> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) > >>> at > >>> > >> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) > >>> at > >>> > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > >>> at > >>> > >> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > >>> at > >>> > >> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) > >>> at > >>> > >> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > >>> at org.eclipse.jetty.server.Server.handle(Server.java:365) > >>> at > >>> > >> > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) > >>> at > >>> > >> > org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) > >>> at > >>> > >> > org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) > >>> at > >>> > >> > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) > >>> at > >> org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856) > >>> at > >>> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) > >>> at > >>> > >> > org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) > >>> at > >>> > >> > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) > >>> at > >>> > >> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > >>> at > >>> > >> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > >>> at java.lang.Thread.run(Unknown Source) > >>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log > >>> Feb 02, 2013 11:36:31 PM > org.apache.solr.cloud.ShardLeaderElectionContext > >>> waitForReplicasToComeUp > >>> INFO: Waiting until we see more replicas up: total=2 found=1 > >> timeoutin=50699 > >>> > >>> Then leader tries to sync with replica and after it finishes I can > >> continue > >>> loading. > >>> None of SolrCloud nodes was restarted during that time. I don't > remember > >>> such behaviour in Solr4.0. Could it be related with the number of > fields > >>> indexed during loading ? I have a collection with about 2400 fields. I > >>> can't reproduce same issue for other collections with much less fields > >> per > >>> record. > >>> Regards. > >>> > >>> On 11 December 2012 19:50, Sudhakar Maddineni <maddineni...@gmail.com > >>> wrote: > >>> > >>>> Just an update on this issue: > >>>> We tried by increasing zookeeper client timeout settings to 30000ms > in > >>>> solr.xml (i think default is 15000ms), and haven't seen any issues > from > >> our > >>>> tests. > >>>> <cores ......... zkClientTimeout="30000" > > >>>> > >>>> Thanks, Sudhakar. > >>>> > >>>> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni > >>>> <maddineni...@gmail.com>wrote: > >>>> > >>>>> We saw this error again today during our load test - basically, > >> whenever > >>>>> session is getting expired on the leader node, we are seeing the > >>>>> error.After this happens, leader(001) is going into 'recovery' mode > and > >>>> all > >>>>> the index updates are failing with "503- service unavailable" error > >>>>> message.After some time(once recovery is successful), roles are > swapped > >>>>> i.e. 001 acting as the replica and 003 as leader. > >>>>> > >>>>> Btw, do you know why the connection to zookeeper[solr->zk] getting > >>>>> interrupted in the middle? > >>>>> is it because of the load(no of updates) we are putting on the > cluster? > >>>>> > >>>>> Here is the exception stack trace: > >>>>> > >>>>> *Dec* *7*, *2012* *2:28:03* *PM* > >>>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader* > >>>>> *WARNING:* > >>>> *org.apache.zookeeper.KeeperException$SessionExpiredException:* > >>>> *KeeperErrorCode* *=* *Session* *expired* *for* > */overseer_elect/leader* > >>>>> *at* > >>>> > >> > *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*) > >>>>> *at* > >>>> > *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*) > >>>>> *at* > *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927* > >>>>> ) > >>>>> *at* > >>>> > >> > *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*) > >>>>> *at* > >>>> > >> > *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*) > >>>>> *at* > >>>> > >> > *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*) > >>>>> *at* > >>>> > >> > *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*) > >>>>> *at* > >>>> > >> > *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*) > >>>>> *at* > >>>> > >> > *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*) > >>>>> *at* *java.lang.Thread.run*(*Unknown* *Source*) > >>>>> > >>>>> Thx,Sudhakar. > >>>>> > >>>>> > >>>>> > >>>>> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni < > >>>> maddineni...@gmail.com > >>>>>> wrote: > >>>>> > >>>>>> Erick: > >>>>>> Not seeing any page caching related issues... > >>>>>> > >>>>>> Mark: > >>>>>> 1.Would this "waiting" on 003(replica) cause any inconsistencies in > >>>> the > >>>>>> zookeeper cluster state? I was also looking at the leader(001) logs > at > >>>> that > >>>>>> time and seeing errors related to "*SEVERE: ClusterState says we are > >> the > >>>>>> leader, but locally we don't think so*". > >>>>>> 2.Also, all of our servers in cluster were gone down when the index > >>>>>> updates were running in parallel along with this issue.Do you see > this > >>>>>> related to the session expiry on 001? > >>>>>> > >>>>>> Here are the logs on 001 > >>>>>> ========================= > >>>>>> > >>>>>> Dec 4, 2012 12:12:29 PM > >>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader > >>>>>> WARNING: > >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>>>> KeeperErrorCode = Session expired for /overseer_elect/leader > >>>>>> at > >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:118) > >>>>>> at > >> org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > >>>>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927) > >>>>>> Dec 4, 2012 12:12:29 PM > >>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader > >>>>>> INFO: According to ZK I > >>>>>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a > >>>> leader. > >>>>>> > >>>>>> Dec 4, 2012 12:12:29 PM > >>>> org.apache.solr.cloud.OverseerCollectionProcessor > >>>>>> run > >>>>>> WARNING: Overseer cannot talk to ZK > >>>>>> > >>>>>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log > >>>>>> SEVERE: There was a problem finding the leader in > >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props > >>>>>> at > >>>>>> > >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) > >>>>>> at > >>>>>> > >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673) > >>>>>> Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log > >>>>>> SEVERE: There was a problem finding the leader in > >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props > >>>>>> at > >>>>>> > >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) > >>>>>> at > >>>>>> > >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673) > >>>>>> Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log > >>>>>> SEVERE: There was a problem making a request to the > >>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on > >>>> state > >>>>>> down for <001>:8080_solr but I still do not see the request state. I > >> see > >>>>>> state: active live:true > >>>>>> at > >>>>>> > >>>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) > >>>>>> Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log > >>>>>> SEVERE: There was a problem making a request to the > >>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on > >>>> state > >>>>>> down for <001>:8080_solr but I still do not see the request state. I > >> see > >>>>>> state: active live:true > >>>>>> at > >>>>>> > >>>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) > >>>>>> .... > >>>>>> .... > >>>>>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log > >>>>>> SEVERE: There was a problem finding the leader in > >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props > >>>>>> at > >>>>>> > >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) > >>>>>> .... > >>>>>> .... > >>>>>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log > >>>>>> SEVERE: :org.apache.solr.common.SolrException: There was a problem > >>>>>> finding the leader in zk > >>>>>> at > >>>>>> > >>>> > >> > org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080) > >>>>>> at > >>>>>> > >>>> > >> > org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273) > >>>>>> Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader > >>>>>> SEVERE: Error getting leader from zk > >>>>>> org.apache.solr.common.SolrException: *There is conflicting > >> information > >>>>>> about the leader of shard: shard1 our state says:http:// > >>>> <001>:8080/solr/core1/ > >>>>>> but zookeeper says:http://<003>:8080/solr/core1/* > >>>>>> * at > >>>> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)* > >>>>>> * at > >> org.apache.solr.cloud.ZkController.register(ZkController.java:577)* > >>>>>> Dec 4, 2012 12:22:30 PM > >>>>>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess > >>>>>> INFO: Running the leader process. > >>>>>> .... > >>>>>> .... > >>>>>> > >>>>>> Thanks for your inputs. > >>>>>> Sudhakar. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmil...@gmail.com > >>>>> wrote: > >>>>>> > >>>>>>> Yes - it means that 001 went down (or more likely had it's > connection > >>>> to > >>>>>>> ZooKeeper interrupted? that's what I mean about a session timeout - > >> if > >>>> the > >>>>>>> solr->zk link is broken for longer than the session timeout that > will > >>>>>>> trigger a leader election and when the connection is reestablished, > >> the > >>>>>>> node will have to recover). That waiting should stop as soon as 001 > >>>> came > >>>>>>> back up or reconnected to ZooKeeper. > >>>>>>> > >>>>>>> In fact, this waiting should not happen in this case - but only on > >>>>>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully > coming > >>>> very > >>>>>>> soon!): > >>>>>>> > >>>>>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the > >>>> code > >>>>>>> path > >>>>>>> for a fresh cluster start rather than fail over. (Mark Miller) > >>>>>>> > >>>>>>> - Mark > >>>>>>> > >>>>>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni < > >> maddineni...@gmail.com > >>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Yep, after restarting, cluster came back to normal state.We will > run > >>>>>>> couple of more tests and see if we could reproduce this issue. > >>>>>>>> > >>>>>>>> Btw, I am attaching the server logs before that 'INFO: Waiting > until > >>>>>>> we see more replicas' message.From the logs, we can see that > leader > >>>>>>> election process started on 003 which was the replica for 001 > >>>>>>> initially.That means leader 001 went down at that time? > >>>>>>>> > >>>>>>>> logs on 003: > >>>>>>>> ======== > >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> runLeaderProcess > >>>>>>>> INFO: Running the leader process. > >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> shouldIBeLeader > >>>>>>>> INFO: Checking if I should try and be the leader. > >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> shouldIBeLeader > >>>>>>>> INFO: My last published State was Active, it's okay to be > the > >>>>>>> leader. > >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> runLeaderProcess > >>>>>>>> INFO: I may be the new leader - try and sync > >>>>>>>> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close > >>>>>>>> WARNING: Stopping recovery for > >>>> zkNodeName=<003>:8080_solr_core > >>>>>>> core=core1. > >>>>>>>> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync > >>>>>>>> INFO: Sync replicas to http://<003>:8080/solr/core1/ > >>>>>>>> 12:11:16 PM org.apache.solr.update.PeerSync sync > >>>>>>>> INFO: PeerSync: core=core1 url=http://<003>:8080/solr START > >>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100 > >>>>>>>> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process > >>>>>>>> INFO: Updating live nodes -> this message is on 002 > >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse > >>>>>>>> WARNING: PeerSync: core=core1 url=http://<003>:8080/solr > >>>>>>> exception talking to <001>:8080/solr/core1/, failed > >>>>>>>> org.apache.solr.client.solrj.SolrServerException: Timeout > >>>>>>> occured while waiting response from server at: > <001>:8080/solr/core1 > >>>>>>>> at > >>>>>>> > >>>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409) > >>>>>>>> at > >>>>>>> > >>>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > >>>>>>>> at > >>>>>>> > >>>> > >> > org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166) > >>>>>>>> at > >>>>>>> > >>>> > >> > org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133) > >>>>>>>> at > >>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown > >>>>>>> Source) > >>>>>>>> at java.util.concurrent.FutureTask.run(Unknown Source) > >>>>>>>> at > >>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) > >>>>>>>> at > >>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown > >>>>>>> Source) > >>>>>>>> at java.util.concurrent.FutureTask.run(Unknown Source) > >>>>>>>> at > >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > >> Source) > >>>>>>>> at > >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > >>>>>>>> at java.lang.Thread.run(Unknown Source) > >>>>>>>> Caused by: java.net.SocketTimeoutException: Read timed out > >>>>>>>> at java.net.SocketInputStream.socketRead0(Native > >>>> Method) > >>>>>>>> at java.net.SocketInputStream.read(Unknown Source) > >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync > >>>>>>>> INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE. > >>>>>>> sync failed > >>>>>>>> 12:11:46 PM org.apache.solr.common.SolrException log > >>>>>>>> SEVERE: Sync Failed > >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> rejoinLeaderElection > >>>>>>>> INFO: There is a better leader candidate than us - going > back > >>>>>>> into recovery > >>>>>>>> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery > >>>>>>>> INFO: Running recovery - first canceling any ongoing > recovery > >>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run > >>>>>>>> INFO: Starting recovery process. core=core1 > >>>>>>> recoveringAfterStartup=false > >>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery > >>>>>>>> INFO: Attempting to PeerSync from <001>:8080/solr/core1/ > >>>>>>> core=core1 - recoveringAfterStartup=false > >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync > >>>>>>>> INFO: PeerSync: core=core1 url=http://<003>:8080/solr START > >>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100 > >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> runLeaderProcess > >>>>>>>> INFO: Running the leader process. > >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> waitForReplicasToComeUp > >>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 > >>>>>>> timeoutin=179999 > >>>>>>>> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> waitForReplicasToComeUp > >>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 > >>>>>>> timeoutin=179495 > >>>>>>>> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> waitForReplicasToComeUp > >>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 > >>>>>>> timeoutin=178985 > >>>>>>>> .... > >>>>>>>> .... > >>>>>>>> > >>>>>>>> Thanks for your help. > >>>>>>>> Sudhakar. > >>>>>>>> > >>>>>>>> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller < > markrmil...@gmail.com> > >>>>>>> wrote: > >>>>>>>> The waiting logging had to happen on restart unless it's some kind > >> of > >>>>>>> bug. > >>>>>>>> > >>>>>>>> Beyond that, something is off, but I have no clue why - it seems > >> your > >>>>>>> clusterstate.json is not up to date at all. > >>>>>>>> > >>>>>>>> Have you tried restarting the cluster then? Does that help at all? > >>>>>>>> > >>>>>>>> Do you see any exceptions around zookeeper session timeouts? > >>>>>>>> > >>>>>>>> - Mark > >>>>>>>> > >>>>>>>> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni < > >>>> maddineni...@gmail.com> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hey Mark, > >>>>>>>>> > >>>>>>>>> Yes, I am able to access all of the nodes under each shard from > >>>>>>> solrcloud > >>>>>>>>> admin UI. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> - *It kind of looks like the urls solrcloud is using are not > >>>>>>> accessible. > >>>>>>>>> When you go to the admin page and the cloud tab, can you access > >>>>>>> the urls it > >>>>>>>>> shows for each shard? That is, if you click on of the links or > >>>>>>> copy and > >>>>>>>>> paste the address into a web browser, does it work?* > >>>>>>>>> > >>>>>>>>> Actually, I got these errors when my document upload task/job was > >>>>>>> running, > >>>>>>>>> not during the cluster restart. Also,job ran fine initially for > the > >>>>>>> first > >>>>>>>>> one hour and started throwing these errors after indexing some > >>>> docx. > >>>>>>>>> > >>>>>>>>> Thx, Sudhakar. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller < > markrmil...@gmail.com > >>>>> > >>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> It kind of looks like the urls solrcloud is using are not > >>>>>>> accessible. When > >>>>>>>>>> you go to the admin page and the cloud tab, can you access the > >>>> urls > >>>>>>> it > >>>>>>>>>> shows for each shard? That is, if you click on of the links or > >>>> copy > >>>>>>> and > >>>>>>>>>> paste the address into a web browser, does it work? > >>>>>>>>>> > >>>>>>>>>> You may have to explicitly set the host= in solr.xml if it's not > >>>>>>> auto > >>>>>>>>>> detecting the right one. Make sure the ports like right too. > >>>>>>>>>> > >>>>>>>>>>> waitForReplicasToComeUp > >>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 > >>>>>>>>>>> timeoutin=179999 > >>>>>>>>>> > >>>>>>>>>> That happens when you stop the cluster and try to start it > again - > >>>>>>> before > >>>>>>>>>> a leader is chosen, it will wait for all known replicas fora > shard > >>>>>>> to come > >>>>>>>>>> up so that everyone can sync up and have a chance to be the best > >>>>>>> leader. So > >>>>>>>>>> at this point it was only finding one of 2 known replicas and > >>>>>>> waiting for > >>>>>>>>>> the second to come up. After a couple minutes (configurable) it > >>>>>>> will just > >>>>>>>>>> continue anyway without the missing replica (if it doesn't show > >>>> up). > >>>>>>>>>> > >>>>>>>>>> - Mark > >>>>>>>>>> > >>>>>>>>>> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni < > >>>>>>> maddineni...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>> We are uploading solr documents to the index in batches using > 30 > >>>>>>> threads > >>>>>>>>>>> and using ThreadPoolExecutor, LinkedBlockingQueue with max > limit > >>>>>>> set to > >>>>>>>>>>> 10000. > >>>>>>>>>>> In the code, we are using HttpSolrServer and add(inputDoc) > method > >>>>>>> to add > >>>>>>>>>>> docx. > >>>>>>>>>>> And, we have the following commit settings in solrconfig: > >>>>>>>>>>> > >>>>>>>>>>> <autoCommit> > >>>>>>>>>>> <maxTime>300000</maxTime> > >>>>>>>>>>> <maxDocs>10000</maxDocs> > >>>>>>>>>>> <openSearcher>false</openSearcher> > >>>>>>>>>>> </autoCommit> > >>>>>>>>>>> > >>>>>>>>>>> <autoSoftCommit> > >>>>>>>>>>> <maxTime>1000</maxTime> > >>>>>>>>>>> </autoSoftCommit> > >>>>>>>>>>> > >>>>>>>>>>> Cluster Details: > >>>>>>>>>>> ---------------------------- > >>>>>>>>>>> solr version - 4.0 > >>>>>>>>>>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes] > >>>>>>>>>>> numshards=2 , > >>>>>>>>>>> 001, 002, 003 are the solr nodes and these three are behind the > >>>>>>>>>>> loadbalancer <vip> > >>>>>>>>>>> 001, 003 assigned to shard1; 002 assigned to shard2 > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Logs:Getting the errors in the below sequence after uploading > >>>> some > >>>>>>> docx: > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>> > >> > ----------------------------------------------------------------------------------------------------------- > >>>>>>>>>>> 003 > >>>>>>>>>>> Dec 4, 2012 12:11:46 PM > >>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>>>>>> waitForReplicasToComeUp > >>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 > >>>>>>>>>>> timeoutin=179999 > >>>>>>>>>>> > >>>>>>>>>>> 001 > >>>>>>>>>>> Dec 4, 2012 12:12:59 PM > >>>>>>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor > >>>>>>>>>>> doDefensiveChecks > >>>>>>>>>>> SEVERE: ClusterState says we are the leader, but locally we > don't > >>>>>>> think > >>>>>>>>>> so > >>>>>>>>>>> > >>>>>>>>>>> 003 > >>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException > log > >>>>>>>>>>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed - > >>>>>>> retrying ... > >>>>>>>>>>> > >>>>>>>>>>> 001 > >>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException > log > >>>>>>>>>>> SEVERE: Error uploading: org.apache.solr.common.SolrException: > >>>>>>> Server at > >>>>>>>>>>> <vip>/solr/core1. returned non ok status:503, message:Service > >>>>>>> Unavailable > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > >>>>>>>>>>> 001 > >>>>>>>>>>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException > log > >>>>>>>>>>> SEVERE: Error while trying to recover. > >>>>>>>>>>> core=core1:org.apache.solr.common.SolrException: We are not the > >>>>>>> leader > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) > >>>>>>>>>>> > >>>>>>>>>>> 001 > >>>>>>>>>>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException > log > >>>>>>>>>>> SEVERE: Error uploading: > >>>>>>>>>> org.apache.solr.client.solrj.SolrServerException: > >>>>>>>>>>> IOException occured when talking to server at <vip>/solr/core1 > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>> > >> > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) > >>>>>>>>>>> at > >>>> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116) > >>>>>>>>>>> ... 5 lines omitted ... > >>>>>>>>>>> at java.lang.Thread.run(Unknown Source) > >>>>>>>>>>> Caused by: java.net.SocketException: Connection reset > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> After sometime, all the three servers are going down. > >>>>>>>>>>> > >>>>>>>>>>> Appreciate, if someone could let us know what we are missing. > >>>>>>>>>>> > >>>>>>>>>>> Thx,Sudhakar. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> <logs_error.txt> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > >> > >