Hi Shawn, Erick
So it turned out that once we increased our indexing rate to the original
full indexing rate  the replicas went back into recovery no matter what the
zk timeout setting was. Initially we though that increasing the timeout is
helping but apparently not . We just decreased indexing rate and that
caused less replicas to go in recovery. Once we have our full indexing rate
almost all replicas went into recovery no matter what the zk timeout or the
ticktime setting were. We reverted back the ticktime to original 2 seconds

So we investigated further and after checking the logs we found this
exception happening right before the recovery process is initiated. We
observed this on two different replicas that went into recovery. We are not
sure if this is a coincidence or a real problem . Notice we were also
putting some search query load while indexing to trigger the recovery
behavior

22:00:32,493 INFO  [org.apache.solr.cloud.RecoveryStrategy]
(rRecoveryThread) Finished recovery process. core=search1_shard5_replica2
22:00:32,503 INFO  [org.apache.solr.common.cloud.ZkStateReader]
(zkCallback-2-thread-66) A cluster state change: WatchedEvent
state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has
occurred - updating... (live nodes size: 22)
22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: trigger
flush: activeBytes=101796784 deleteBytes=3061644 vs limit=104857600
22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread state
has 12530488 bytes; docInRAM=2051
22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread state
has 12984633 bytes; docInRAM=2205


22:00:40,861 ERROR [org.apache.solr.core.SolrCore] (http-/10.235.46.36:8580-32)
ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk
header*
        at
org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:351)
        at
org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:422)
        at
org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:373)
        at
org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:193)
        at
org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
        at
org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
        at
org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:192)
        at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:111)
        at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173)
        at
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106)
        at
org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
        at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
        at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:246)
        at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
        at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
        at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:149)
        at
org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:169)
        at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:145)
        at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97)
        at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:559)
        at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102)
        at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336)
        at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856)
                                          at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: JBWEB002020: Invalid chunk header
        at
org.apache.coyote.http11.filters.ChunkedInputFilter.parseChunkHeader(ChunkedInputFilter.java:281)
        at
org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:134)
        at
org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:697)
        at org.apache.coyote.Request.doRead(Request.java:438)
        at
org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:341)
        ... 31 more
22:00:40,864 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: new
segment has 0 deleted docs
22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: new
segment has no vectors; no norms; no docValues; prox; freqs
22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]:
flushedFiles=[_227_Lucene41_0.tip, _227.fdx, _227_Lucene41_0.tim, _227.fdt,
_227_Lucene41_0.doc, _227.fnm, _227_Lucene41_0.pos]
22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: flushed
codec=Lucene410
22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: flushed:
segment=_227 ramUsed=12.383 MB newFlushedSize(includes docstores)=4.444 MB
docs/MB=496.165
22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [DW][http-/10.235.46.36:8580-27]:
publishFlushedSegment seg-private updates=null
22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [IW][http-/10.235.46.36:8580-27]:
publishFlushedSegment
22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [BD][http-/10.235.46.36:8580-27]: push deletes
 16553 deleted terms (unique count=16553) bytesUsed=222240 delGen=17
packetCount=1 totBytesUsed=222240
22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [IW][http-/10.235.46.36:8580-27]: publish sets
newSegment delGen=18 seg=_227(4.10.0):C2205
22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [IFD][http-/10.235.46.36:8580-27]: now
checkpoint "_12l(4.10.0):C676510/354682:delGen=208
_1wo(4.10.0):C420813/69819:delGen=40 _1bs(4.10.0):C229862/110117:delGen=77
_1ga(4.10.0):C322386/84434:delGen=91 _1nc(4.10.0):C209477/2949:delGen=52
_223(4.10.0):C206753/11:delGen=1 _1ug(4.10.0):C7185/1:delGen=1
_20n(4.10.0):C59384/2287:delGen=7 _1zm(4.10.0):C78927/1581:delGen=12
_20v(4.10.0):C4317 _20u(4.10.0):C5243/1:delGen=1 _20z(4.10.0):C3288
_20y(4.10.0):C4614/1:delGen=1 _21i(4.10.0):C2190 _21f(4.10.0):C16
_21g(4.10.0):C67 _21h(4.10.0):C18 _21o(4.10.0):C116
_21q(4.10.0):C1243/1:delGen=1 _21r(4.10.0):C39 _21s(4.10.0):C92
_21t(4.10.0):C1 _21u(4.10.0):C1 _222(4.10.0):C26988
_224(4.10.0):C28193/1:delGen=1 _225(4.10.0):C25966 _227(4.10.0):C2205" [27
segments ; isCommit = false]
22:00:40,866 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [IFD][http-/10.235.46.36:8580-27]: 0 msec to
checkpoint
22:00:40,866 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [TMP][http-/10.235.46.36:8580-27]: findMerges:
27 segments
10.235.46.36:8580-27]: now merge
22:00:40,867 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [CMS][http-/10.235.46.36:8580-27]:   index:
_12l(4.10.0):C676510/354682:delGen=208 _1wo(4.10.0):C420813/69819:delGen=40
_1bs(4.10.0):C229862/110117:delGen=77 _1ga(4.10.0):C322386/84434:delGen=91
_1nc(4.10.0):C209477/2949:delGen=52 _223(4.10.0):C206753/11:delGen=1
_1ug(4.10.0):C7185/1:delGen=1 _20n(4.10.0):C59384/2287:delGen=7
_1zm(4.10.0):C78927/1581:delGen=12 _20v(4.10.0):C4317
_20u(4.10.0):C5243/1:delGen=1 _20z(4.10.0):C3288
_20y(4.10.0):C4614/1:delGen=1 _21i(4.10.0):C2190 _21f(4.10.0):C16
_21g(4.10.0):C67 _21h(4.10.0):C18 _21o(4.10.0):C116
_21q(4.10.0):C1243/1:delGen=1 _21r(4.10.0):C39 _21s(4.10.0):C92
_21t(4.10.0):C1 _21u(4.10.0):C1 _222(4.10.0):C26988
_224(4.10.0):C28193/1:delGen=1 _225(4.10.0):C25966 _227(4.10.0):C2205
22:00:40,867 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [CMS][http-/10.235.46.36:8580-27]:   no more
merges pending; now return
22:00:41,885 INFO  [org.apache.solr.handler.admin.CoreAdminHandler]
(http-/10.235.46.36:8580-32) It has been requested that we recover:
core=search1_shard5_replica2
22:00:41,886 INFO  [org.apache.solr.servlet.SolrDispatchFilter]
(http-/10.235.46.36:8580-32) [admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERY&core=search1_shard5_replica2&wt=javabin&version=2}
status=0 QTime=1
22:00:41,886 INFO  [org.apache.solr.cloud.ZkController] (Thread-472)
publishing core=search1_shard5_replica2 state=recovering collection=search1
22:00:41,886 INFO  [org.apache.solr.cloud.ZkController] (Thread-472)
numShards not found on descriptor - reading it from system property

22:00:41,889 INFO  [org.apache.solr.common.cloud.ZkStateReader]
(zkCallback-2-thread-66) A cluster state change: WatchedEvent
state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has
occurred - updating... (live nodes size: 22)
22:00:41,890 INFO  [org.apache.solr.cloud.ZkController] (Thread-472)* Wrote
recovering to
/collections/search1/leader_initiated_recovery/shard5/core_node223*
22:00:41,892 INFO  [org.apache.solr.update.DefaultSolrCoreState]
(Thread-472) Running recovery - first canceling any ongoing recovery
22:00:41,893 INFO  [org.apache.solr.cloud.RecoveryStrategy]
(RecoveryThread) Starting recovery process.  core=search1_shard5_replica2
recoveringAfterStartup=false
22:00:41,894 INFO  [org.apache.solr.cloud.RecoveryStrategy]
(RecoveryThread) Publishing state of core search1_shard5_replica2 as
recovering, leader is http://XXXXXXXXXXCHANGEDONPURPOSEXXXX:8680/solr/search1/
and I am http://solrx341p.qa.ch3.s.com:8580/solr/search1_shard5_replica2/
22:00:41,894 INFO  [org.apache.solr.cloud.ZkController] (RecoveryThread)
publishing core=search1_shard5_replica2 state=recovering collection=search1
22:00:41,894 INFO  [org.apache.solr.cloud.ZkController] (RecoveryThread)
numShards not found on descriptor - reading it from system property
22:00:41,896 INFO  [org.apache.solr.cloud.RecoveryStrategy]
(RecoveryThread) Sending prep recovery command to
http://XXXXXXXXXCHANGEDONPURPOSEXXXXXXXX:8680/solr; WaitForState:
action=PREPRECOVERY&core=search1&nodeName=XXXXXXXXXXXXXXCHANGEDONPURPOSE
XXXXXXXXXXXXXXX%3A8580_solr&coreNodeName=core_node223&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true
22:00:42,006 INFO  [org.apache.solr.common.cloud.ZkStateReader]
(zkCallback-2-thread-66) A cluster state change: WatchedEvent
state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has
occurred - updating... (live nodes size: 22)




15Gb of heap is allocated to these searchers and it hardly goes over 8GB.
Full GC does not happen. We write gc logs to separate file and also
monitored these process via visualvm . It seems to have enough memory


On Mon, Jan 26, 2015 at 5:13 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 1/26/2015 2:26 PM, Vijay Sekhri wrote:
> > Hi Erick,
> > In solr.xml file I had zk timeout set to/ <int
> > name="zkClientTimeout">${zkClientTimeout:450000}</int>/
> > One thing that made a it a bit better now is the zk tick time and
> > syncLimit settings. I set it to a higher value as below. This may not
> > be advisable though.
> >
> > tickTime=30000
> > initLimit=30
> > syncLimit=20
> >
> > Now we observed that replicas do not go in recovery that often as
> > before. In the whole cluster at a given time I would have a couple of
> > replicas in recovery whereas earlier it were multiple replicas from
> > every shard .
> > On the wiki https://wiki.apache.org/solr/SolrCloudit says the "The
> > maximum is 20 times the tickTime." in the FAQ so I decided to increase
> > the tick time. Is this the correct approach ?
>
> The default zkClientTimeout on recent Solr versions is 30 seconds, up
> from 15 in slightly older releases.
>
> Those values of 15 or 30 seconds are a REALLY long time in computer
> terms, and if you are exceeding that timeout on a regular basis,
> something is VERY wrong with your Solr install.  Rather than take steps
> to increase your timeout beyond the normal maximum of 40 seconds (20
> times a tickTime of 2 seconds), figure out why you're exceeding that
> timeout and fix the performance problem.  The zkClientTimeout value that
> you have set, 450 seconds, is seven and a half *MINUTES*.  Nothing in
> Solr should ever take that long.
>
> "Not enough memory in the server" is by far the most common culprit for
> performance issues.  Garbage collection pauses are a close second.
>
> I don't actually know this next part for sure, because I've never looked
> into the code, but I believe that increasing the tickTime, especially to
> a value 15 times higher than default, might make all zookeeper
> operations a lot slower.
>
> Thanks,
> Shawn
>
>


-- 
*********************************************
Vijay Sekhri
*********************************************

Reply via email to