Solr Version: 4.3.1 Number Shards: 10 Replicas: 1 Heap size: 15GB Machine RAM: 30GB Zookeeper timeout: 45 seconds
We are continuing the fight to keep our solr setup functioning. As a result of this we have made significant changes to our schema to reduce the amount of data we write. I setup a new cluster to reindex our data, initially I ran the import with no replicas, and achieved quite impressive results. Our peak was 60,000 new documents per minute, no shard loses, no outages due to garbage collection (which is an issue we see in production), at the end of the load the index stood at 97,000,000 documents and 20GB per shard. During the highest insertion rate I would say that querying suffered, but that is not of concern right now. I have now added in 1 replica for each shard, indexing time has doubled - not surprising - and as it was so good to start with not a problem. I continue to just write to the leaders and the issue is that that replicas are continually going into recovery. The leaders show: ERROR - 2014-02-14 11:47:45.757; org.apache.solr.common.SolrException; shard update error StdNode: http://10.35.133.176:8983/solr/sessionfilterset/:org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://10.35.133.176:8983/solr/sessionfilterset at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:401) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:375) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.http.NoHttpResponseException: The target server failed to respond at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) ... 11 more The replica is not busy garbage collecting, as it doesn't coincide with a full gc and the collection times are low. The replica appears to be accepting adds milliseconds before this appears in the log: INFO - 2014-02-14 11:59:54.366; org.apache.solr.handler.admin.CoreAdminHandler; It has been requested that we recover I have reduced the load down to 5,000 documents per minute and they appear to only stay up for a couple of minutes, I would like to be confident that we could handle more than this during our peak times. Initially I was getting connection reset errors on the leaders, but I changed the jetty connector to the nio one and now the above message is what I have received. I have also upped the header request and response sizes. Any ideas - other than not using replicas as proposed by a colleague? Thanks very much in advance. -- Annette Newton Database Administrator ServiceTick Ltd T:+44(0)1603 618326 Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ www.servicetick.com *www.sessioncam.com <http://www.sessioncam.com>* -- *This message is confidential and is intended to be read solely by the addressee. The contents should not be disclosed to any other person or copies taken unless authorised to do so. If you are not the intended recipient, please notify the sender and permanently delete this message. As Internet communications are not secure ServiceTick accepts neither legal responsibility for the contents of this message nor responsibility for any change made to this message after it was forwarded by the original author.*