Hi Shawn, Erick So it turned out that once we increased our indexing rate to the original full indexing rate the replicas went back into recovery no matter what the zk timeout setting was. Initially we though that increasing the timeout is helping but apparently not . We just decreased indexing rate and that caused less replicas to go in recovery. Once we have our full indexing rate almost all replicas went into recovery no matter what the zk timeout or the ticktime setting were. We reverted back the ticktime to original 2 seconds
So we investigated further and after checking the logs we found this exception happening right before the recovery process is initiated. We observed this on two different replicas that went into recovery. We are not sure if this is a coincidence or a real problem . Notice we were also putting some search query load while indexing to trigger the recovery behavior 22:00:32,493 INFO [org.apache.solr.cloud.RecoveryStrategy] (rRecoveryThread) Finished recovery process. core=search1_shard5_replica2 22:00:32,503 INFO [org.apache.solr.common.cloud.ZkStateReader] (zkCallback-2-thread-66) A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 22) 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: trigger flush: activeBytes=101796784 deleteBytes=3061644 vs limit=104857600 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread state has 12530488 bytes; docInRAM=2051 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread state has 12984633 bytes; docInRAM=2205 22:00:40,861 ERROR [org.apache.solr.core.SolrCore] (http-/10.235.46.36:8580-32) ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk header* at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:351) at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:422) at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:373) at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:193) at org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80) at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89) at org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:192) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:111) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173) at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106) at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:246) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:149) at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:169) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:145) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:559) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.IOException: JBWEB002020: Invalid chunk header at org.apache.coyote.http11.filters.ChunkedInputFilter.parseChunkHeader(ChunkedInputFilter.java:281) at org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:134) at org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:697) at org.apache.coyote.Request.doRead(Request.java:438) at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:341) ... 31 more 22:00:40,864 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: new segment has 0 deleted docs 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: new segment has no vectors; no norms; no docValues; prox; freqs 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: flushedFiles=[_227_Lucene41_0.tip, _227.fdx, _227_Lucene41_0.tim, _227.fdt, _227_Lucene41_0.doc, _227.fnm, _227_Lucene41_0.pos] 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: flushed codec=Lucene410 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: flushed: segment=_227 ramUsed=12.383 MB newFlushedSize(includes docstores)=4.444 MB docs/MB=496.165 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [DW][http-/10.235.46.36:8580-27]: publishFlushedSegment seg-private updates=null 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [IW][http-/10.235.46.36:8580-27]: publishFlushedSegment 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [BD][http-/10.235.46.36:8580-27]: push deletes 16553 deleted terms (unique count=16553) bytesUsed=222240 delGen=17 packetCount=1 totBytesUsed=222240 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [IW][http-/10.235.46.36:8580-27]: publish sets newSegment delGen=18 seg=_227(4.10.0):C2205 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [IFD][http-/10.235.46.36:8580-27]: now checkpoint "_12l(4.10.0):C676510/354682:delGen=208 _1wo(4.10.0):C420813/69819:delGen=40 _1bs(4.10.0):C229862/110117:delGen=77 _1ga(4.10.0):C322386/84434:delGen=91 _1nc(4.10.0):C209477/2949:delGen=52 _223(4.10.0):C206753/11:delGen=1 _1ug(4.10.0):C7185/1:delGen=1 _20n(4.10.0):C59384/2287:delGen=7 _1zm(4.10.0):C78927/1581:delGen=12 _20v(4.10.0):C4317 _20u(4.10.0):C5243/1:delGen=1 _20z(4.10.0):C3288 _20y(4.10.0):C4614/1:delGen=1 _21i(4.10.0):C2190 _21f(4.10.0):C16 _21g(4.10.0):C67 _21h(4.10.0):C18 _21o(4.10.0):C116 _21q(4.10.0):C1243/1:delGen=1 _21r(4.10.0):C39 _21s(4.10.0):C92 _21t(4.10.0):C1 _21u(4.10.0):C1 _222(4.10.0):C26988 _224(4.10.0):C28193/1:delGen=1 _225(4.10.0):C25966 _227(4.10.0):C2205" [27 segments ; isCommit = false] 22:00:40,866 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [IFD][http-/10.235.46.36:8580-27]: 0 msec to checkpoint 22:00:40,866 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [TMP][http-/10.235.46.36:8580-27]: findMerges: 27 segments 10.235.46.36:8580-27]: now merge 22:00:40,867 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [CMS][http-/10.235.46.36:8580-27]: index: _12l(4.10.0):C676510/354682:delGen=208 _1wo(4.10.0):C420813/69819:delGen=40 _1bs(4.10.0):C229862/110117:delGen=77 _1ga(4.10.0):C322386/84434:delGen=91 _1nc(4.10.0):C209477/2949:delGen=52 _223(4.10.0):C206753/11:delGen=1 _1ug(4.10.0):C7185/1:delGen=1 _20n(4.10.0):C59384/2287:delGen=7 _1zm(4.10.0):C78927/1581:delGen=12 _20v(4.10.0):C4317 _20u(4.10.0):C5243/1:delGen=1 _20z(4.10.0):C3288 _20y(4.10.0):C4614/1:delGen=1 _21i(4.10.0):C2190 _21f(4.10.0):C16 _21g(4.10.0):C67 _21h(4.10.0):C18 _21o(4.10.0):C116 _21q(4.10.0):C1243/1:delGen=1 _21r(4.10.0):C39 _21s(4.10.0):C92 _21t(4.10.0):C1 _21u(4.10.0):C1 _222(4.10.0):C26988 _224(4.10.0):C28193/1:delGen=1 _225(4.10.0):C25966 _227(4.10.0):C2205 22:00:40,867 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [CMS][http-/10.235.46.36:8580-27]: no more merges pending; now return 22:00:41,885 INFO [org.apache.solr.handler.admin.CoreAdminHandler] (http-/10.235.46.36:8580-32) It has been requested that we recover: core=search1_shard5_replica2 22:00:41,886 INFO [org.apache.solr.servlet.SolrDispatchFilter] (http-/10.235.46.36:8580-32) [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERY&core=search1_shard5_replica2&wt=javabin&version=2} status=0 QTime=1 22:00:41,886 INFO [org.apache.solr.cloud.ZkController] (Thread-472) publishing core=search1_shard5_replica2 state=recovering collection=search1 22:00:41,886 INFO [org.apache.solr.cloud.ZkController] (Thread-472) numShards not found on descriptor - reading it from system property 22:00:41,889 INFO [org.apache.solr.common.cloud.ZkStateReader] (zkCallback-2-thread-66) A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 22) 22:00:41,890 INFO [org.apache.solr.cloud.ZkController] (Thread-472)* Wrote recovering to /collections/search1/leader_initiated_recovery/shard5/core_node223* 22:00:41,892 INFO [org.apache.solr.update.DefaultSolrCoreState] (Thread-472) Running recovery - first canceling any ongoing recovery 22:00:41,893 INFO [org.apache.solr.cloud.RecoveryStrategy] (RecoveryThread) Starting recovery process. core=search1_shard5_replica2 recoveringAfterStartup=false 22:00:41,894 INFO [org.apache.solr.cloud.RecoveryStrategy] (RecoveryThread) Publishing state of core search1_shard5_replica2 as recovering, leader is http://XXXXXXXXXXCHANGEDONPURPOSEXXXX:8680/solr/search1/ and I am http://solrx341p.qa.ch3.s.com:8580/solr/search1_shard5_replica2/ 22:00:41,894 INFO [org.apache.solr.cloud.ZkController] (RecoveryThread) publishing core=search1_shard5_replica2 state=recovering collection=search1 22:00:41,894 INFO [org.apache.solr.cloud.ZkController] (RecoveryThread) numShards not found on descriptor - reading it from system property 22:00:41,896 INFO [org.apache.solr.cloud.RecoveryStrategy] (RecoveryThread) Sending prep recovery command to http://XXXXXXXXXCHANGEDONPURPOSEXXXXXXXX:8680/solr; WaitForState: action=PREPRECOVERY&core=search1&nodeName=XXXXXXXXXXXXXXCHANGEDONPURPOSE XXXXXXXXXXXXXXX%3A8580_solr&coreNodeName=core_node223&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true 22:00:42,006 INFO [org.apache.solr.common.cloud.ZkStateReader] (zkCallback-2-thread-66) A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 22) 15Gb of heap is allocated to these searchers and it hardly goes over 8GB. Full GC does not happen. We write gc logs to separate file and also monitored these process via visualvm . It seems to have enough memory On Mon, Jan 26, 2015 at 5:13 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 1/26/2015 2:26 PM, Vijay Sekhri wrote: > > Hi Erick, > > In solr.xml file I had zk timeout set to/ <int > > name="zkClientTimeout">${zkClientTimeout:450000}</int>/ > > One thing that made a it a bit better now is the zk tick time and > > syncLimit settings. I set it to a higher value as below. This may not > > be advisable though. > > > > tickTime=30000 > > initLimit=30 > > syncLimit=20 > > > > Now we observed that replicas do not go in recovery that often as > > before. In the whole cluster at a given time I would have a couple of > > replicas in recovery whereas earlier it were multiple replicas from > > every shard . > > On the wiki https://wiki.apache.org/solr/SolrCloudit says the "The > > maximum is 20 times the tickTime." in the FAQ so I decided to increase > > the tick time. Is this the correct approach ? > > The default zkClientTimeout on recent Solr versions is 30 seconds, up > from 15 in slightly older releases. > > Those values of 15 or 30 seconds are a REALLY long time in computer > terms, and if you are exceeding that timeout on a regular basis, > something is VERY wrong with your Solr install. Rather than take steps > to increase your timeout beyond the normal maximum of 40 seconds (20 > times a tickTime of 2 seconds), figure out why you're exceeding that > timeout and fix the performance problem. The zkClientTimeout value that > you have set, 450 seconds, is seven and a half *MINUTES*. Nothing in > Solr should ever take that long. > > "Not enough memory in the server" is by far the most common culprit for > performance issues. Garbage collection pauses are a close second. > > I don't actually know this next part for sure, because I've never looked > into the code, but I believe that increasing the tickTime, especially to > a value 15 times higher than default, might make all zookeeper > operations a lot slower. > > Thanks, > Shawn > > -- ********************************************* Vijay Sekhri *********************************************