We have a cluster running SolrCloud 4.7 built 2/25.  10 shards with 2
replicas each (20 shards total) at about ~20GB/shard.

We index around 1k-1.5k documents/second into this cluster constantly.  To
manage growth we have a scheduled job that runs every 3 hours to prune
documents based on business rules.  Lately this job has taken to failing.
 There are several facet queries before our delete queries but we're
generally deleting ~10k documents at a time.  Auto hard commits are set to
every 60 seconds and auto soft commits at 10 seconds.  Each node has enough
RAM to page cache the entire data set.  We run multiple JVMs per node to
help with GC.

When our pruning job is running, it has started to completely wedge the
UpdateHandler.  Indexing stops and takes 20-60 minutes to recover.  The
prune job encounters multiple read timeouts.

My guess is that the UpdateHandler blocks because shards are going into
recovery because they can't keep up with the documents sent over
replication after hard commits.  I suspect either updates/replication are
the issue or shard size because we have another (larger) cluster with
5GB/shard and no replication that seems to handle load better.

Some logs from 2 of the 4 -- the other nodes have similar logs to these
with SnapPuller / PeerSync on one and Connection Reset errors on the other:

Mar 10 20:13:35 solr-5e.i.jobcorp.com [Thread-29775]
org.apache.solr.cloud.RecoveryStrategy Stopping recovery for
zkNodeName=core_node25core=solr_shard10_8987
Mar 10 20:13:35 solr-5e.i.jobcorp.com [Thread-29772]
org.apache.solr.cloud.RecoveryStrategy Stopping recovery for
zkNodeName=core_node25core=solr_shard10_8987
Mar 10 21:05:45 solr-5e.i.jobcorp.com [Thread-37627]
org.apache.solr.cloud.RecoveryStrategy Stopping recovery for
zkNodeName=core_node21core=solr_shard6_8983
Mar 10 21:05:47 solr-5e.i.jobcorp.com [RecoveryThread]
org.apache.solr.update.PeerSync PeerSync: core=solr_shard6_8983 url=
http://solr-5e.i.jobcorp.com:8983/solr too many updates received since
start - startingUpdates no longer overlaps with our currentUpdates
Mar 10 21:05:47 solr-5e.i.jobcorp.com [RecoveryThread]
org.apache.solr.handler.SnapPuller File _fqp9x_Lucene41_0.tip expected to
be 495806 while it is 107332
Mar 10 21:07:33 solr-5e.i.jobcorp.com [recoveryExecutor-6-thread-4]
org.apache.solr.update.UpdateLog Starting log replay
tlog{file=/mnt/solr/data/solr_shard6_8983/tlog/tlog.0000000000000005547
refcount=2} active=true starting pos=2249142
Mar 10 21:08:06 solr-5e.i.jobcorp.com [recoveryExecutor-6-thread-4]
org.apache.solr.update.UpdateLog Log replay finished.
recoveryInfo=RecoveryInfo{adds=45282 deletes=0 deleteByQuery=252 errors=0
positionOfStart=2249142}
Mar 11 00:08:24 solr-5e.i.jobcorp.com [RecoveryThread]
org.apache.solr.update.PeerSync PeerSync: core=solr_shard8_8985 url=
http://solr-5e.i.jobcorp.com:8985/solr too many updates received since
start - startingUpdates no longer overlaps with our currentUpdates
Mar 11 00:09:20 solr-5e.i.jobcorp.com [commitScheduler-8-thread-1]
org.apache.solr.core.SolrCore [solr_shard8_8985] PERFORMANCE WARNING:
Overlapping onDeckSearchers=2
Mar 11 00:09:29 solr-5e.i.jobcorp.com [recoveryExecutor-6-thread-6]
org.apache.solr.update.UpdateLog Starting log replay
tlog{file=/mnt/solr/data/solr_shard8_8985/tlog/tlog.0000000000000005717
refcount=2} active=true starting pos=1329158
Mar 11 00:09:31 solr-5e.i.jobcorp.com [recoveryExecutor-6-thread-6]
org.apache.solr.core.SolrCore [solr_shard8_8985] PERFORMANCE WARNING:
Overlapping onDeckSearchers=2
Mar 11 00:09:50 solr-5e.i.jobcorp.com [recoveryExecutor-6-thread-6]
org.apache.solr.update.UpdateLog Log replay finished.
recoveryInfo=RecoveryInfo{adds=8069 deletes=0 deleteByQuery=14 errors=0
positionOfStart=1329158}

Different node:
Mar 11 02:36:32 solr-3d.i.jobcorp.com [updateExecutor-1-thread-74378]
org.apache.solr.update.StreamingSolrServers error
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011java.net.SocketException:
Connection reset
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
java.net.SocketInputStream.read(SocketInputStream.java:196)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
java.net.SocketInputStream.read(SocketInputStream.java:122)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:232)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Mar 11 02:36:32 solr-3d.i.jobcorp.com #011at
java.lang.Thread.run(Thread.java:724)
Mar 11 02:36:32 solr-3d.i.jobcorp.com [qtp653085562-645554]
org.apache.solr.servlet.SolrDispatchFilter null:java.net.SocketException:
Connection reset#012#011at
java.net.SocketInputStream.read(SocketInputStream.java:196)#012#011at
java.net.SocketInputStream.read(SocketInputStream.java:122)#012#011at
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)#012#011at
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)#012#011at
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)#012#011at
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)#012#011at
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)#012#011at
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)#012#011at
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)#012#011at
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)#012#011at
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)#012#011at
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)#012#011at
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)#012#011at
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)#012#011at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)#012#011at
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)#012#011at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)#012#011at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)#012#011at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttp

Any help/thoughts appreciated,

@ralphtice

Reply via email to