After some more playing around on 5x I have duplicated the issue. I'll file a 
JIRA issue for you and fix it shortly.

- Mark

On Dec 8, 2012, at 8:43 AM, Mark Miller <markrmil...@gmail.com> wrote:

> Hmm…I've tried to replicate what looked like a bug from your report (3 Solr 
> servers stop/start ), but on 5x it works no problem for me. It shouldn't be 
> any different on 4x, but I'll try that next.
> 
> In terms of starting up Solr without a working ZooKeeper ensemble - it won't 
> work currently. Cores won't be able to register with ZooKeeper and will fail 
> loading. It would probably be nicer to come up in search only mode and keep 
> trying to reconnect to zookeeper - file a JIRA issue if you are interested.
> 
> On the zk data dir, see 
> http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html#Ongoing+Data+Directory+Cleanup
> 
> - Mark
> 
> On Dec 7, 2012, at 10:22 PM, Mark Miller <markrmil...@gmail.com> wrote:
> 
>> Hey, I'll try and answer this tomorrow.
>> 
>> There is a def an unreported bug in there that needs to be fixed for the 
>> restarting the all nodes case.
>> 
>> Also, a 404 one is generally when jetty is starting or stopping - there are 
>> points where 404's can be returned. I'm not sure why else you'd see one. 
>> Generally we do retries when that happens.
>> 
>> - Mark
>> 
>> On Dec 7, 2012, at 1:07 PM, Alain Rogister <alain.rogis...@gmail.com> wrote:
>> 
>>> I am reporting the results of my stress tests against Solr 4.x. As I was
>>> getting many error conditions with 4.0, I switched to the 4.1 trunk in the
>>> hope that some of the issues would be fixed already. Here is my setup :
>>> 
>>> - Everything running on a single box (2 x 4-core CPUs, 8 GB RAM). I realize
>>> this is not representative of a production environment but it's a fine way
>>> to find out what happens under resource-constrained conditions.
>>> - 3 Solr servers, 3 cores (2 of which are very small, the third one has 410
>>> MB of data)
>>> - single shard
>>> - 3 Zookeeper instances
>>> - HAProxy load balancing requests across Solr servers
>>> - JMeter or ApacheBench running the tests : 5 thread pools of 20 threads
>>> each, sending search requests continuously (no updates)
>>> 
>>> In nominal conditions, it all works fine i.e. it can process a million
>>> requests, maxing out the CPUs at all time, without experiencing nasty
>>> failures. There are errors in the logs about replication failures though;
>>> they should be benigne in this case as no updates are taking place but it's
>>> hard to tell what is going on exactly. Example :
>>> 
>>> Dec 07, 2012 7:50:37 PM org.apache.solr.update.PeerSync handleResponse
>>> WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr
>>> exception talking to
>>> http://192.168.0.101:8985/solr/adressage/, failed
>>> org.apache.solr.common.SolrException: Server at
>>> http://192.168.0.101:8985/solr/adressage returned non ok status:404,
>>> message:Not Found
>>> at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>>> at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> at
>>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>>> at
>>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>> at java.lang.Thread.run(Thread.java:722)
>>> 
>>> Then I simulated various failure scenarios :
>>> 
>>> - 1 Solr server stop/start
>>> - 2 Solr servers stop/start
>>> - 3 Solr servers stop/start : it seems that in this case, the Solr servers
>>> *cannot* be restarted : more exactly, the restarted server will consider
>>> that it is number 1 out of 4 and wait for the other 3 to come up. The only
>>> way out is to stop it again, then stop all Zookeeper instances *and* clean
>>> up their zkdata directory, start them, then start the Solr servers.
>>> 
>>> I noticed that these zkdata directory had grown to 200 MB after a while.
>>> What exactly is in there besides the configuration data ? Does it stop
>>> growing ?
>>> 
>>> Then I tried this :
>>> 
>>> - kill 1 Zookeeper process
>>> - kill 2 Zookeeper processes
>>> - stop/start 1 Solr server
>>> 
>>> When doing this, I experienced (many times) situations where the Solr
>>> servers could not reconnect and threw scary exceptions. The only way out
>>> was to restart the whole cluster.
>>> 
>>> Q : when, if ever, is one supposed to clean up the zkdata directories ?
>>> 
>>> Here are the errors I found in the logs. It seems that some of them have
>>> been reported in JIRA but 4.1-trunk seems to experience basically the same
>>> issues as 4.0 in my test scenarios.
>>> 
>>> Dec 07, 2012 8:03:59 PM org.apache.solr.update.PeerSync handleResponse
>>> WARNING: PeerSync: core=cachede url=http://192.168.0.101:8983/solr
>>> couldn't connect to
>>> http://192.168.0.101:8984/solr/cachede/, counting as success
>>> Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log
>>> SEVERE: Sync request error:
>>> org.apache.solr.client.solrj.SolrServerException: Server refused connection
>>> at: http://192.168.0.101:8984/solr/cachede
>>> Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log
>>> SEVERE: http://192.168.0.101:8983/solr/cachede/: Could not tell a replica
>>> to recover:org.apache.solr.client.solrj.SolrServerException: Server refused
>>> connection at: http://192.168.0.101:8984/solr
>>> at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:406)
>>> at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> at org.apache.solr.cloud.SyncStrategy$1.run(SyncStrategy.java:293)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>> at java.lang.Thread.run(Thread.java:722)
>>> Caused by: org.apache.http.conn.HttpHostConnectException: Connection to
>>> http://192.168.0.101:8984 refused
>>> at
>>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158)
>>> at
>>> org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150)
>>> at
>>> org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121)
>>> at
>>> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:575)
>>> at
>>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:425)
>>> at
>>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
>>> at
>>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
>>> at
>>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732)
>>> at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
>>> ... 5 more
>>> Caused by: java.net.ConnectException: Connection refused
>>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>>> at
>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>>> at
>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>>> at
>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
>>> at java.net.Socket.connect(Socket.java:579)
>>> at
>>> org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123)
>>> at
>>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148)
>>> ... 13 more
>>> 
>>> Dec 07, 2012 8:03:59 PM org.apache.solr.update.PeerSync handleResponse
>>> WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr  got a
>>> 404 from http://192.168.0.101:8985/solr/adressage/, counting as success
>>> Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log
>>> SEVERE: Sync request error: org.apache.solr.common.SolrException: Server at
>>> http://192.168.0.101:8985/solr/adressage returned non ok status:404,
>>> message:Not Found
>>> Dec 07, 2012 8:04:00 PM org.apache.solr.update.PeerSync handleResponse
>>> WARNING: PeerSync: core=formabanque url=http://192.168.0.101:8983/solr  got
>>> a 404 from http://192.168.0.101:8985/solr/formabanque/, counting as success
>>> Dec 07, 2012 8:04:00 PM org.apache.solr.common.SolrException log
>>> SEVERE: Sync request error: org.apache.solr.common.SolrException: Server at
>>> http://192.168.0.101:8985/solr/formabanque returned non ok status:404,
>>> message:Not Found
>>> 
>>> Dec 07, 2012 8:04:32 PM org.apache.solr.update.PeerSync sync
>>> WARNING: no frame of reference to tell of we've missed updates
>>> 
>>> Dec 07, 2012 8:03:58 PM org.apache.solr.common.SolrException log
>>> SEVERE: Error while trying to
>>> recover:org.apache.solr.client.solrj.SolrServerException: Server refused
>>> connection at: http://192.168.0.101:8984/solr/adressage
>>> at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:406)
>>> at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> at
>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>> at
>>> org.apache.solr.cloud.RecoveryStrategy.commitOnLeader(RecoveryStrategy.java:182)
>>> at
>>> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:134)
>>> at
>>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
>>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
>>> Caused by: org.apache.http.conn.HttpHostConnectException: Connection to
>>> http://192.168.0.101:8984 refused
>>> at
>>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158)
>>> at
>>> org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150)
>>> at
>>> org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121)
>>> at
>>> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:575)
>>> at
>>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:425)
>>> at
>>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
>>> at
>>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
>>> at
>>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732)
>>> at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
>>> ... 6 more
>>> Caused by: java.net.ConnectException: Connection refused
>>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>>> at
>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>>> at
>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>>> at
>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
>>> at java.net.Socket.connect(Socket.java:579)
>>> at
>>> org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123)
>>> at
>>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148)
>>> ... 14 more
>>> 
>>> Dec 07, 2012 8:03:58 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>>> SEVERE: Recovery failed - trying again... (0) core=adressage
>>> 
>>> SEVERE: Error getting leader from zk
>>> org.apache.solr.common.SolrException: Could not get leader props
>>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:735)
>>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:699)
>>> at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:664)
>>> at org.apache.solr.cloud.ZkController.register(ZkController.java:603)
>>> at org.apache.solr.cloud.ZkController.register(ZkController.java:558)
>>> at org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:791)
>>> at org.apache.solr.core.CoreContainer.register(CoreContainer.java:775)
>>> at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:567)
>>> at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:562)
>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>> at java.lang.Thread.run(Thread.java:722)
>>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>>> KeeperErrorCode = NoNode for /collections/adressage/leaders/shard1
>>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>>> at
>>> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:244)
>>> at
>>> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:241)
>>> at
>>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:63)
>>> at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:241)
>>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:713)
>>> ... 16 more
>>> 
>>> Dec 07, 2012 4:39:23 PM org.apache.solr.common.SolrException log
>>> SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:
>>> at
>>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:159)
>>> at
>>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>> at java.lang.Thread.run(Thread.java:722)
>> 
> 

Reply via email to