We have a cluster running SolrCloud 4.7 built 2/25. 10 shards with 2 replicas each (20 shards total) at about ~20GB/shard.
We index around 1k-1.5k documents/second into this cluster constantly. To manage growth we have a scheduled job that runs every 3 hours to prune documents based on business rules. Lately this job has taken to failing. There are several facet queries before our delete queries but we're generally deleting ~10k documents at a time. Auto hard commits are set to every 60 seconds and auto soft commits at 10 seconds. Each node has enough RAM to page cache the entire data set. We run multiple JVMs per node to help with GC. When our pruning job is running, it has started to completely wedge the UpdateHandler. Indexing stops and takes 20-60 minutes to recover. The prune job encounters multiple read timeouts. My guess is that the UpdateHandler blocks because shards are going into recovery because they can't keep up with the documents sent over replication after hard commits. I suspect either updates/replication are the issue or shard size because we have another (larger) cluster with 5GB/shard and no replication that seems to handle load better. Some logs from 2 of the 4 -- the other nodes have similar logs to these with SnapPuller / PeerSync on one and Connection Reset errors on the other: Mar 10 20:13:35 solr-5e.i.jobcorp.com [Thread-29775] org.apache.solr.cloud.RecoveryStrategy Stopping recovery for zkNodeName=core_node25core=solr_shard10_8987 Mar 10 20:13:35 solr-5e.i.jobcorp.com [Thread-29772] org.apache.solr.cloud.RecoveryStrategy Stopping recovery for zkNodeName=core_node25core=solr_shard10_8987 Mar 10 21:05:45 solr-5e.i.jobcorp.com [Thread-37627] org.apache.solr.cloud.RecoveryStrategy Stopping recovery for zkNodeName=core_node21core=solr_shard6_8983 Mar 10 21:05:47 solr-5e.i.jobcorp.com [RecoveryThread] org.apache.solr.update.PeerSync PeerSync: core=solr_shard6_8983 url= http://solr-5e.i.jobcorp.com:8983/solr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Mar 10 21:05:47 solr-5e.i.jobcorp.com [RecoveryThread] org.apache.solr.handler.SnapPuller File _fqp9x_Lucene41_0.tip expected to be 495806 while it is 107332 Mar 10 21:07:33 solr-5e.i.jobcorp.com [recoveryExecutor-6-thread-4] org.apache.solr.update.UpdateLog Starting log replay tlog{file=/mnt/solr/data/solr_shard6_8983/tlog/tlog.0000000000000005547 refcount=2} active=true starting pos=2249142 Mar 10 21:08:06 solr-5e.i.jobcorp.com [recoveryExecutor-6-thread-4] org.apache.solr.update.UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=45282 deletes=0 deleteByQuery=252 errors=0 positionOfStart=2249142} Mar 11 00:08:24 solr-5e.i.jobcorp.com [RecoveryThread] org.apache.solr.update.PeerSync PeerSync: core=solr_shard8_8985 url= http://solr-5e.i.jobcorp.com:8985/solr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Mar 11 00:09:20 solr-5e.i.jobcorp.com [commitScheduler-8-thread-1] org.apache.solr.core.SolrCore [solr_shard8_8985] PERFORMANCE WARNING: Overlapping onDeckSearchers=2 Mar 11 00:09:29 solr-5e.i.jobcorp.com [recoveryExecutor-6-thread-6] org.apache.solr.update.UpdateLog Starting log replay tlog{file=/mnt/solr/data/solr_shard8_8985/tlog/tlog.0000000000000005717 refcount=2} active=true starting pos=1329158 Mar 11 00:09:31 solr-5e.i.jobcorp.com [recoveryExecutor-6-thread-6] org.apache.solr.core.SolrCore [solr_shard8_8985] PERFORMANCE WARNING: Overlapping onDeckSearchers=2 Mar 11 00:09:50 solr-5e.i.jobcorp.com [recoveryExecutor-6-thread-6] org.apache.solr.update.UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=8069 deletes=0 deleteByQuery=14 errors=0 positionOfStart=1329158} Different node: Mar 11 02:36:32 solr-3d.i.jobcorp.com [updateExecutor-1-thread-74378] org.apache.solr.update.StreamingSolrServers error Mar 11 02:36:32 solr-3d.i.jobcorp.com #011java.net.SocketException: Connection reset Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at java.net.SocketInputStream.read(SocketInputStream.java:196) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at java.net.SocketInputStream.read(SocketInputStream.java:122) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:232) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at java.lang.Thread.run(Thread.java:724) Mar 11 02:36:32 solr-3d.i.jobcorp.com [qtp653085562-645554] org.apache.solr.servlet.SolrDispatchFilter null:java.net.SocketException: Connection reset#012#011at java.net.SocketInputStream.read(SocketInputStream.java:196)#012#011at java.net.SocketInputStream.read(SocketInputStream.java:122)#012#011at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)#012#011at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)#012#011at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)#012#011at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)#012#011at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)#012#011at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)#012#011at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)#012#011at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)#012#011at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)#012#011at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)#012#011at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)#012#011at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)#012#011at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)#012#011at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)#012#011at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)#012#011at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)#012#011at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttp Any help/thoughts appreciated, @ralphtice