Hey, I'll try and answer this tomorrow. There is a def an unreported bug in there that needs to be fixed for the restarting the all nodes case.
Also, a 404 one is generally when jetty is starting or stopping - there are points where 404's can be returned. I'm not sure why else you'd see one. Generally we do retries when that happens. - Mark On Dec 7, 2012, at 1:07 PM, Alain Rogister <alain.rogis...@gmail.com> wrote: > I am reporting the results of my stress tests against Solr 4.x. As I was > getting many error conditions with 4.0, I switched to the 4.1 trunk in the > hope that some of the issues would be fixed already. Here is my setup : > > - Everything running on a single box (2 x 4-core CPUs, 8 GB RAM). I realize > this is not representative of a production environment but it's a fine way > to find out what happens under resource-constrained conditions. > - 3 Solr servers, 3 cores (2 of which are very small, the third one has 410 > MB of data) > - single shard > - 3 Zookeeper instances > - HAProxy load balancing requests across Solr servers > - JMeter or ApacheBench running the tests : 5 thread pools of 20 threads > each, sending search requests continuously (no updates) > > In nominal conditions, it all works fine i.e. it can process a million > requests, maxing out the CPUs at all time, without experiencing nasty > failures. There are errors in the logs about replication failures though; > they should be benigne in this case as no updates are taking place but it's > hard to tell what is going on exactly. Example : > > Dec 07, 2012 7:50:37 PM org.apache.solr.update.PeerSync handleResponse > WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr > exception talking to > http://192.168.0.101:8985/solr/adressage/, failed > org.apache.solr.common.SolrException: Server at > http://192.168.0.101:8985/solr/adressage returned non ok status:404, > message:Not Found > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > at > org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166) > at > org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > > Then I simulated various failure scenarios : > > - 1 Solr server stop/start > - 2 Solr servers stop/start > - 3 Solr servers stop/start : it seems that in this case, the Solr servers > *cannot* be restarted : more exactly, the restarted server will consider > that it is number 1 out of 4 and wait for the other 3 to come up. The only > way out is to stop it again, then stop all Zookeeper instances *and* clean > up their zkdata directory, start them, then start the Solr servers. > > I noticed that these zkdata directory had grown to 200 MB after a while. > What exactly is in there besides the configuration data ? Does it stop > growing ? > > Then I tried this : > > - kill 1 Zookeeper process > - kill 2 Zookeeper processes > - stop/start 1 Solr server > > When doing this, I experienced (many times) situations where the Solr > servers could not reconnect and threw scary exceptions. The only way out > was to restart the whole cluster. > > Q : when, if ever, is one supposed to clean up the zkdata directories ? > > Here are the errors I found in the logs. It seems that some of them have > been reported in JIRA but 4.1-trunk seems to experience basically the same > issues as 4.0 in my test scenarios. > > Dec 07, 2012 8:03:59 PM org.apache.solr.update.PeerSync handleResponse > WARNING: PeerSync: core=cachede url=http://192.168.0.101:8983/solr > couldn't connect to > http://192.168.0.101:8984/solr/cachede/, counting as success > Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log > SEVERE: Sync request error: > org.apache.solr.client.solrj.SolrServerException: Server refused connection > at: http://192.168.0.101:8984/solr/cachede > Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log > SEVERE: http://192.168.0.101:8983/solr/cachede/: Could not tell a replica > to recover:org.apache.solr.client.solrj.SolrServerException: Server refused > connection at: http://192.168.0.101:8984/solr > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:406) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > at org.apache.solr.cloud.SyncStrategy$1.run(SyncStrategy.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > Caused by: org.apache.http.conn.HttpHostConnectException: Connection to > http://192.168.0.101:8984 refused > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158) > at > org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150) > at > org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:575) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:425) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) > ... 5 more > Caused by: java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) > at java.net.Socket.connect(Socket.java:579) > at > org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148) > ... 13 more > > Dec 07, 2012 8:03:59 PM org.apache.solr.update.PeerSync handleResponse > WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr got a > 404 from http://192.168.0.101:8985/solr/adressage/, counting as success > Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log > SEVERE: Sync request error: org.apache.solr.common.SolrException: Server at > http://192.168.0.101:8985/solr/adressage returned non ok status:404, > message:Not Found > Dec 07, 2012 8:04:00 PM org.apache.solr.update.PeerSync handleResponse > WARNING: PeerSync: core=formabanque url=http://192.168.0.101:8983/solr got > a 404 from http://192.168.0.101:8985/solr/formabanque/, counting as success > Dec 07, 2012 8:04:00 PM org.apache.solr.common.SolrException log > SEVERE: Sync request error: org.apache.solr.common.SolrException: Server at > http://192.168.0.101:8985/solr/formabanque returned non ok status:404, > message:Not Found > > Dec 07, 2012 8:04:32 PM org.apache.solr.update.PeerSync sync > WARNING: no frame of reference to tell of we've missed updates > > Dec 07, 2012 8:03:58 PM org.apache.solr.common.SolrException log > SEVERE: Error while trying to > recover:org.apache.solr.client.solrj.SolrServerException: Server refused > connection at: http://192.168.0.101:8984/solr/adressage > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:406) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > at > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) > at > org.apache.solr.cloud.RecoveryStrategy.commitOnLeader(RecoveryStrategy.java:182) > at > org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:134) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) > at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) > Caused by: org.apache.http.conn.HttpHostConnectException: Connection to > http://192.168.0.101:8984 refused > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158) > at > org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150) > at > org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:575) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:425) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) > ... 6 more > Caused by: java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) > at java.net.Socket.connect(Socket.java:579) > at > org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148) > ... 14 more > > Dec 07, 2012 8:03:58 PM org.apache.solr.cloud.RecoveryStrategy doRecovery > SEVERE: Recovery failed - trying again... (0) core=adressage > > SEVERE: Error getting leader from zk > org.apache.solr.common.SolrException: Could not get leader props > at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:735) > at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:699) > at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:664) > at org.apache.solr.cloud.ZkController.register(ZkController.java:603) > at org.apache.solr.cloud.ZkController.register(ZkController.java:558) > at org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:791) > at org.apache.solr.core.CoreContainer.register(CoreContainer.java:775) > at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:567) > at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:562) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for /collections/adressage/leaders/shard1 > at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:244) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:241) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:63) > at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:241) > at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:713) > ... 16 more > > Dec 07, 2012 4:39:23 PM org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: no servers hosting shard: > at > org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:159) > at > org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722)