Whenever I see "memory consumption changed",  my first question is are
any fields that sort, group or facet set with docValues="false"? I
consider this unlikely since one of the changes recently was to
default to "true" for primitive types, but it's worth checking. The
easiest way to check (and this does not have to be on a node that's
having problems, any node will do that's been serving queries for a
while) is to go into the admin
UI>>select_a_core>>plugins/stats>>cache>>fieldCache. Are there any
entries there?

This part of your logs is very suspicious:

2018-12-12 19:57:13.730 WARN
(recoveryExecutor-4-thread-3-processing-n:x.x.23.51:8983_solr
x:knowledge_shard1_replica_n1 c:knowledge s:shard1 r:core_node7)
[c:knowledge s:shard1 r:core_node7 x:knowledge_shard1_replica_n1]
o.a.s.c.ZkController Unable to read
/collections/knowledge/leader_initiated_recovery/shard1/core_node7 due
to: org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/collections/knowledge/leader_initiated_recovery/shard1/core_node7

Solr 7.5 uses ZK as part of knowing whether shards are up to date,
there was a presentation at Activate by  Dat Cao Manh about the
details, but I think if your ZK connection is timing out that may be
the underlying cause. with a 60 second timeout that would be odd to be
just a timeout issue, so that's puzzling.

Have you done anything odd with ZooKeeper? Like replace nodes or the
like? ZK had an issue at one point where resolving the DNS names for
reconfigured ZK nodes didn't work well. If you've restarted all your
instances (zk first of course) that wouldn't be relevant.

Just a few random thoughts...

Best,
Erick
On Wed, Dec 12, 2018 at 2:09 PM Shamik Bandopadhyay <sham...@gmail.com> wrote:
>
> Hi,
>
>   We recently upgraded Solr from 6.5 to 7.5. We are observing some weird
> issues in terms of recovery and memory usage. We've a cluster of 6 physical
> nodes with 2 shards having two replica each. 7.5 seemed to have a higher
> memory consumption where the average heap utilization hovers around 18 gb.
> Couple of days back, one of the replicas went down as heap (30 gb) was
> exhausted. Upon restart, the instance came back quickly but then started a
> spiral effect where one of the nodes in the cluster kept going down one
> after the other. So at any point of time, there were 5 instances available
> instead of 6. Every time we would bring the bad instance back up, it would
> be functional immediately but the shard it was recovering from will
> eventually (within minutes) go down . This cycle (of restarting instances)
> went for almost an hour before all the nodes were finally started active.
> It again occurred today where we are observing similar behavior. We even
> stopped the indexing pipeline to make sure that recovery is minimal. But it
> didn't make any difference, the error is consistent, the affected node goes
> into a recovery (not sure why) and encounters session time out in the
> process.
>
> 2018-12-12 19:59:16.813 ERROR
> (recoveryExecutor-4-thread-3-processing-n:x.x.23.51:8983_solr
> x:knowledge_shard1_replica_n1 c:knowledge s:shard1 r:core_node7)
> [c:knowledge s:shard1 r:core_node7 x:knowledge_shard1_replica_n1]
> o.a.s.c.RecoveryStrategy Error while trying to recover.
> core=knowledge_shard1_replica_n1:org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer/queue/qn-
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>  at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:786)
>  at
> org.apache.solr.common.cloud.SolrZkClient.lambda$create$7(SolrZkClient.java:398)
>  at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
>  at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:398)
>  at
> org.apache.solr.cloud.ZkDistributedQueue.offer(ZkDistributedQueue.java:321)
>  at org.apache.solr.cloud.ZkController.publish(ZkController.java:1548)
>  at org.apache.solr.cloud.ZkController.publish(ZkController.java:1436)
>  at
> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:549)
>  at
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:310)
>  at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:294)
>  at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>  at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
>  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>  at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
>  at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
>  at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
>  at java.base/java.lang.Thread.run(Thread.java:844)
>
> I've added the dropbox links to the relevant error from solr log, solr gc
> log, top screenshot and heap usage from spm monitor.
>
> Solr log --> https://www.dropbox.com/s/als26je0dgrp10r/solr.log?dl=0
>
> Solr gc log 1 -->
> https://www.dropbox.com/s/m0ikb3kc9enme4f/solr_gc.log.0?dl=0
>
> Solr gc log 2 -->
> https://www.dropbox.com/s/jfs7wcjyult5ud8/solr_gc.log.1?dl=0
>
> Top --> https://www.dropbox.com/s/x6f0zwqfbabybd0/top.png?dl=0
>
> SPM monitor screenshots:
>
> JVM pool --> https://www.dropbox.com/s/nbko83eflp8y2tp/jvm_pool.png?dl=0
>
> GC screenshot --> https://www.dropbox.com/s/6zofkvgfknxwjgd/gc.png?dl=0
>
> Solr Cache --> https://www.dropbox.com/s/o6zsxwal6pzspve/cache.png?dl=0
>
> Here are the relevant entries from solr startup script:
>
> SOLR_JAVA_MEM="-Xms40g -Xmx40g"
>
> GC_LOG_OPTS='-Xlog:gc*'
>
> GC_TUNE="-XX:+UseG1GC \
> -XX:+PerfDisableSharedMem \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=250 \
> -XX:InitiatingHeapOccupancyPercent=75 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> -XX:OnOutOfMemoryError=/mnt/ebs2/solrhome/bin/oom_solr.sh"
>
> ZK_CLIENT_TIMEOUT="60000"
>
> SOLR_WAIT_FOR_ZK="180"
>
> SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=120000"
> SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=600000"
> SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"
> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=120000"
>
> All our nodes are running on AWS having 16 vCPU and 128gb . We started with
> 30gb heap space as the average utilization was between 18-19 gb. For this
> recovery issue, we tried bumping up to 40gb but didn't make any difference.
> We are using jdk 9.0.4+11 with 6 Solr nodes and 3 zookeeper (3.4.10) quorum.
>
> Out index has close to 10 million documents with 55gb index size. Not sure
> if it's relevant, but we have noticed that filter cache utilization has
> drastically reduced (0.17) while document cache has gone up (0.61). It used
> to be 0.9 and 0.3 in Solr 6.5.
>
> Not sure what we are missing here in terms of Solr upgrade to 7.5 I can
> provide other relevant information.
>
> Thanks,
> Shamik

Reply via email to