After some more playing around on 5x I have duplicated the issue. I'll file a JIRA issue for you and fix it shortly.
- Mark On Dec 8, 2012, at 8:43 AM, Mark Miller <markrmil...@gmail.com> wrote: > Hmm…I've tried to replicate what looked like a bug from your report (3 Solr > servers stop/start ), but on 5x it works no problem for me. It shouldn't be > any different on 4x, but I'll try that next. > > In terms of starting up Solr without a working ZooKeeper ensemble - it won't > work currently. Cores won't be able to register with ZooKeeper and will fail > loading. It would probably be nicer to come up in search only mode and keep > trying to reconnect to zookeeper - file a JIRA issue if you are interested. > > On the zk data dir, see > http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html#Ongoing+Data+Directory+Cleanup > > - Mark > > On Dec 7, 2012, at 10:22 PM, Mark Miller <markrmil...@gmail.com> wrote: > >> Hey, I'll try and answer this tomorrow. >> >> There is a def an unreported bug in there that needs to be fixed for the >> restarting the all nodes case. >> >> Also, a 404 one is generally when jetty is starting or stopping - there are >> points where 404's can be returned. I'm not sure why else you'd see one. >> Generally we do retries when that happens. >> >> - Mark >> >> On Dec 7, 2012, at 1:07 PM, Alain Rogister <alain.rogis...@gmail.com> wrote: >> >>> I am reporting the results of my stress tests against Solr 4.x. As I was >>> getting many error conditions with 4.0, I switched to the 4.1 trunk in the >>> hope that some of the issues would be fixed already. Here is my setup : >>> >>> - Everything running on a single box (2 x 4-core CPUs, 8 GB RAM). I realize >>> this is not representative of a production environment but it's a fine way >>> to find out what happens under resource-constrained conditions. >>> - 3 Solr servers, 3 cores (2 of which are very small, the third one has 410 >>> MB of data) >>> - single shard >>> - 3 Zookeeper instances >>> - HAProxy load balancing requests across Solr servers >>> - JMeter or ApacheBench running the tests : 5 thread pools of 20 threads >>> each, sending search requests continuously (no updates) >>> >>> In nominal conditions, it all works fine i.e. it can process a million >>> requests, maxing out the CPUs at all time, without experiencing nasty >>> failures. There are errors in the logs about replication failures though; >>> they should be benigne in this case as no updates are taking place but it's >>> hard to tell what is going on exactly. Example : >>> >>> Dec 07, 2012 7:50:37 PM org.apache.solr.update.PeerSync handleResponse >>> WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr >>> exception talking to >>> http://192.168.0.101:8985/solr/adressage/, failed >>> org.apache.solr.common.SolrException: Server at >>> http://192.168.0.101:8985/solr/adressage returned non ok status:404, >>> message:Not Found >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>> at >>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166) >>> at >>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133) >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >>> at java.lang.Thread.run(Thread.java:722) >>> >>> Then I simulated various failure scenarios : >>> >>> - 1 Solr server stop/start >>> - 2 Solr servers stop/start >>> - 3 Solr servers stop/start : it seems that in this case, the Solr servers >>> *cannot* be restarted : more exactly, the restarted server will consider >>> that it is number 1 out of 4 and wait for the other 3 to come up. The only >>> way out is to stop it again, then stop all Zookeeper instances *and* clean >>> up their zkdata directory, start them, then start the Solr servers. >>> >>> I noticed that these zkdata directory had grown to 200 MB after a while. >>> What exactly is in there besides the configuration data ? Does it stop >>> growing ? >>> >>> Then I tried this : >>> >>> - kill 1 Zookeeper process >>> - kill 2 Zookeeper processes >>> - stop/start 1 Solr server >>> >>> When doing this, I experienced (many times) situations where the Solr >>> servers could not reconnect and threw scary exceptions. The only way out >>> was to restart the whole cluster. >>> >>> Q : when, if ever, is one supposed to clean up the zkdata directories ? >>> >>> Here are the errors I found in the logs. It seems that some of them have >>> been reported in JIRA but 4.1-trunk seems to experience basically the same >>> issues as 4.0 in my test scenarios. >>> >>> Dec 07, 2012 8:03:59 PM org.apache.solr.update.PeerSync handleResponse >>> WARNING: PeerSync: core=cachede url=http://192.168.0.101:8983/solr >>> couldn't connect to >>> http://192.168.0.101:8984/solr/cachede/, counting as success >>> Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log >>> SEVERE: Sync request error: >>> org.apache.solr.client.solrj.SolrServerException: Server refused connection >>> at: http://192.168.0.101:8984/solr/cachede >>> Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log >>> SEVERE: http://192.168.0.101:8983/solr/cachede/: Could not tell a replica >>> to recover:org.apache.solr.client.solrj.SolrServerException: Server refused >>> connection at: http://192.168.0.101:8984/solr >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:406) >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>> at org.apache.solr.cloud.SyncStrategy$1.run(SyncStrategy.java:293) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >>> at java.lang.Thread.run(Thread.java:722) >>> Caused by: org.apache.http.conn.HttpHostConnectException: Connection to >>> http://192.168.0.101:8984 refused >>> at >>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158) >>> at >>> org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150) >>> at >>> org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121) >>> at >>> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:575) >>> at >>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:425) >>> at >>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) >>> at >>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) >>> at >>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732) >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) >>> ... 5 more >>> Caused by: java.net.ConnectException: Connection refused >>> at java.net.PlainSocketImpl.socketConnect(Native Method) >>> at >>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) >>> at >>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) >>> at >>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) >>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) >>> at java.net.Socket.connect(Socket.java:579) >>> at >>> org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123) >>> at >>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148) >>> ... 13 more >>> >>> Dec 07, 2012 8:03:59 PM org.apache.solr.update.PeerSync handleResponse >>> WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr got a >>> 404 from http://192.168.0.101:8985/solr/adressage/, counting as success >>> Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log >>> SEVERE: Sync request error: org.apache.solr.common.SolrException: Server at >>> http://192.168.0.101:8985/solr/adressage returned non ok status:404, >>> message:Not Found >>> Dec 07, 2012 8:04:00 PM org.apache.solr.update.PeerSync handleResponse >>> WARNING: PeerSync: core=formabanque url=http://192.168.0.101:8983/solr got >>> a 404 from http://192.168.0.101:8985/solr/formabanque/, counting as success >>> Dec 07, 2012 8:04:00 PM org.apache.solr.common.SolrException log >>> SEVERE: Sync request error: org.apache.solr.common.SolrException: Server at >>> http://192.168.0.101:8985/solr/formabanque returned non ok status:404, >>> message:Not Found >>> >>> Dec 07, 2012 8:04:32 PM org.apache.solr.update.PeerSync sync >>> WARNING: no frame of reference to tell of we've missed updates >>> >>> Dec 07, 2012 8:03:58 PM org.apache.solr.common.SolrException log >>> SEVERE: Error while trying to >>> recover:org.apache.solr.client.solrj.SolrServerException: Server refused >>> connection at: http://192.168.0.101:8984/solr/adressage >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:406) >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>> at >>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) >>> at >>> org.apache.solr.cloud.RecoveryStrategy.commitOnLeader(RecoveryStrategy.java:182) >>> at >>> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:134) >>> at >>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) >>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) >>> Caused by: org.apache.http.conn.HttpHostConnectException: Connection to >>> http://192.168.0.101:8984 refused >>> at >>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158) >>> at >>> org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150) >>> at >>> org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121) >>> at >>> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:575) >>> at >>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:425) >>> at >>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) >>> at >>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) >>> at >>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732) >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) >>> ... 6 more >>> Caused by: java.net.ConnectException: Connection refused >>> at java.net.PlainSocketImpl.socketConnect(Native Method) >>> at >>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) >>> at >>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) >>> at >>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) >>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) >>> at java.net.Socket.connect(Socket.java:579) >>> at >>> org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123) >>> at >>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148) >>> ... 14 more >>> >>> Dec 07, 2012 8:03:58 PM org.apache.solr.cloud.RecoveryStrategy doRecovery >>> SEVERE: Recovery failed - trying again... (0) core=adressage >>> >>> SEVERE: Error getting leader from zk >>> org.apache.solr.common.SolrException: Could not get leader props >>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:735) >>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:699) >>> at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:664) >>> at org.apache.solr.cloud.ZkController.register(ZkController.java:603) >>> at org.apache.solr.cloud.ZkController.register(ZkController.java:558) >>> at org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:791) >>> at org.apache.solr.core.CoreContainer.register(CoreContainer.java:775) >>> at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:567) >>> at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:562) >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >>> at java.lang.Thread.run(Thread.java:722) >>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: >>> KeeperErrorCode = NoNode for /collections/adressage/leaders/shard1 >>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) >>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) >>> at >>> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:244) >>> at >>> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:241) >>> at >>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:63) >>> at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:241) >>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:713) >>> ... 16 more >>> >>> Dec 07, 2012 4:39:23 PM org.apache.solr.common.SolrException log >>> SEVERE: org.apache.solr.common.SolrException: no servers hosting shard: >>> at >>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:159) >>> at >>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133) >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >>> at java.lang.Thread.run(Thread.java:722) >> >