Re: solrcloud used a lot of memory and memory keep increasing during long time run

zhenglingyun Sun, 20 Dec 2015 22:45:07 -0800

Thanks Erick for pointing out the memory change in a sawtooth pattern.
The problem troubles me is that the bottom point of the sawtooth keeps 
increasing.
And when the used capacity of old generation exceeds the threshold set by CMS’s
CMSInitiatingOccupancyFraction, gc keeps running and uses a lot of CPU cycle
but the used old generation memory does not decrease.


After I take Rahul’s advice, I decrease the Xms and Xmx from 16G to 8G, and
adjust the parameters of JVM from
    -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
    -XX:-CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=70
    -XX:+CMSParallelRemarkEnabled
to
    -XX:NewRatio=3
    -XX:SurvivorRatio=4
    -XX:TargetSurvivorRatio=90
    -XX:MaxTenuringThreshold=8
    -XX:+UseConcMarkSweepGC
    -XX:+UseParNewGC
    -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4
    -XX:+CMSScavengeBeforeRemark
    -XX:PretenureSizeThreshold=64m
    -XX:+UseCMSInitiatingOccupancyOnly
    -XX:CMSInitiatingOccupancyFraction=50
    -XX:CMSMaxAbortablePrecleanTime=6000
    -XX:+CMSParallelRemarkEnabled
    -XX:+ParallelRefProcEnabled
    -XX:-CMSConcurrentMTEnabled
which is taken from bin/solr.in.sh
I hope this can reduce gc pause time and full gc times.
And maybe the memory increasing problem will disappear if I’m lucky.

After several day's running, the memory on one of my two servers increased to 
90% again…
(When solr is started, the memory used by solr is less than 1G.)

Following is the output of stat -gccause -h5 <pid> 1000:

  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC 
                GCC
  9.56   0.00   8.65  91.31  65.89  69379 3076.096 16563 1579.639 4655.735 
Allocation Failure   No GC
  9.56   0.00  51.10  91.31  65.89  69379 3076.096 16563 1579.639 4655.735 
Allocation Failure   No GC
  0.00   9.23  10.23  91.35  65.89  69380 3076.135 16563 1579.639 4655.774 
Allocation Failure   No GC
  7.90   0.00   9.74  91.39  65.89  69381 3076.165 16564 1579.683 4655.848 CMS 
Final Remark     No GC
  7.90   0.00  67.45  91.39  65.89  69381 3076.165 16564 1579.683 4655.848 CMS 
Final Remark     No GC
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC 
                GCC
  0.00   7.48  16.18  91.41  65.89  69382 3076.200 16565 1579.707 4655.908 CMS 
Initial Mark     No GC
  0.00   7.48  73.77  91.41  65.89  69382 3076.200 16565 1579.707 4655.908 CMS 
Initial Mark     No GC
  8.61   0.00  29.86  91.45  65.89  69383 3076.228 16565 1579.707 4655.936 
Allocation Failure   No GC
  8.61   0.00  90.16  91.45  65.89  69383 3076.228 16565 1579.707 4655.936 
Allocation Failure   No GC
  0.00   7.46  47.89  91.46  65.89  69384 3076.258 16565 1579.707 4655.966 
Allocation Failure   No GC
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC 
                GCC
  8.67   0.00  11.98  91.49  65.89  69385 3076.287 16565 1579.707 4655.995 
Allocation Failure   No GC
  0.00  11.76   9.24  91.54  65.89  69386 3076.321 16566 1579.759 4656.081 CMS 
Final Remark     No GC
  0.00  11.76  64.53  91.54  65.89  69386 3076.321 16566 1579.759 4656.081 CMS 
Final Remark     No GC
  7.25   0.00  20.39  91.57  65.89  69387 3076.358 16567 1579.786 4656.144 CMS 
Initial Mark     No GC
  7.25   0.00  81.56  91.57  65.89  69387 3076.358 16567 1579.786 4656.144 CMS 
Initial Mark     No GC
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC 
                GCC
  0.00   8.05  34.42  91.60  65.89  69388 3076.391 16567 1579.786 4656.177 
Allocation Failure   No GC
  0.00   8.05  84.17  91.60  65.89  69388 3076.391 16567 1579.786 4656.177 
Allocation Failure   No GC
  8.54   0.00  55.14  91.62  65.89  69389 3076.420 16567 1579.786 4656.205 
Allocation Failure   No GC
  0.00   7.74  12.42  91.66  65.89  69390 3076.456 16567 1579.786 4656.242 
Allocation Failure   No GC
  9.60   0.00  11.00  91.70  65.89  69391 3076.492 16568 1579.841 4656.333 CMS 
Final Remark     No GC
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC 
                GCC
  9.60   0.00  69.24  91.70  65.89  69391 3076.492 16568 1579.841 4656.333 CMS 
Final Remark     No GC
  0.00   8.70  18.21  91.74  65.89  69392 3076.529 16569 1579.870 4656.400 CMS 
Initial Mark     No GC
  0.00   8.70  61.92  91.74  65.89  69392 3076.529 16569 1579.870 4656.400 CMS 
Initial Mark     No GC
  7.36   0.00   3.49  91.77  65.89  69393 3076.570 16569 1579.870 4656.440 
Allocation Failure   No GC
  7.36   0.00  42.03  91.77  65.89  69393 3076.570 16569 1579.870 4656.440 
Allocation Failure   No GC
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC 
                GCC
  0.00   9.77   0.00  91.80  65.89  69394 3076.604 16569 1579.870 4656.475 
Allocation Failure   No GC
  9.08   0.00   9.92  91.82  65.89  69395 3076.632 16570 1579.913 4656.545 CMS 
Final Remark     No GC
  9.08   0.00  58.90  91.82  65.89  69395 3076.632 16570 1579.913 4656.545 CMS 
Final Remark     No GC
  0.00   8.44  16.20  91.86  65.89  69396 3076.664 16571 1579.930 4656.594 CMS 
Initial Mark     No GC
  0.00   8.44  71.95  91.86  65.89  69396 3076.664 16571 1579.930 4656.594 CMS 
Initial Mark     No GC
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC 
                GCC
  8.11   0.00  30.59  91.90  65.89  69397 3076.694 16571 1579.930 4656.624 
Allocation Failure   No GC
  8.11   0.00  93.41  91.90  65.89  69397 3076.694 16571 1579.930 4656.624 
Allocation Failure   No GC
  0.00   9.77  57.34  91.96  65.89  69398 3076.724 16571 1579.930 4656.654 
Allocation Failure   No GC

Full gc seems can’t free any garbage any more (Or the garbage produced is as 
fast as gc freed?)
On the other hand, another replication of the collection on another server(the 
collection has two replications)
uses 40% of old generation memory, and doesn’t trigger so many full gc.


Following is the output of eclipse MAT leak suspects:

  Problem Suspect 1

4,741 instances of "org.apache.lucene.index.SegmentCoreReaders", loaded by 
"org.apache.catalina.loader.WebappClassLoader @ 0x67d8ed978" occupy 
3,743,067,520 (64.12%) bytes. These instances are referenced from one instance 
of "java.lang.Object[]", loaded by "<system class loader>"

Keywords
java.lang.Object[]
org.apache.catalina.loader.WebappClassLoader @ 0x67d8ed978
org.apache.lucene.index.SegmentCoreReaders

Details »
  Problem Suspect 2

2,815 instances of "org.apache.lucene.index.StandardDirectoryReader", loaded by 
"org.apache.catalina.loader.WebappClassLoader @ 0x67d8ed978" occupy 970,614,912 
(16.63%) bytes. These instances are referenced from one instance of 
"java.lang.Object[]", loaded by "<system class loader>"

Keywords
java.lang.Object[]
org.apache.catalina.loader.WebappClassLoader @ 0x67d8ed978
org.apache.lucene.index.StandardDirectoryReader

Details »



Class structure in above “Details":

java.lang.Thread @XXX
    <Java Local> java.util.ArrayList @XXXX
        elementData java.lang.Object[3141] @XXXX
            org.apache.lucene.search.FieldCache$CacheEntry @XXXX
            org.apache.lucene.search.FieldCache$CacheEntry @XXXX
            org.apache.lucene.search.FieldCache$CacheEntry @XXXX
            …
a lot of org.apache.lucene.search.FieldCache$CacheEntry (1205 in Suspect 1, 
2785 in Suspect 2)

Does these lots of org.apache.lucene.search.FieldCache$CacheEntry normal?

Thanks.




> 在 2015年12月16日，00:44，Erick Erickson <erickerick...@gmail.com> 写道：
> 
> Rahul's comments were spot on. You can gain more confidence that this
> is normal if if you try attaching a memory reporting program (jconsole
> is one) you'll see the memory grow for quite a while, then garbage
> collection kicks in and you'll see it drop in a sawtooth pattern.
> 
> Best,
> Erick
> 
> On Tue, Dec 15, 2015 at 8:19 AM, zhenglingyun <konghuaru...@163.com> wrote:
>> Thank you very much.
>> I will try reduce the heap memory and check if the memory still keep 
>> increasing or not.
>> 
>>> 在 2015年12月15日，19:37，Rahul Ramesh <rr.ii...@gmail.com> 写道：
>>> 
>>> You should actually decrease solr heap size. Let me explain a bit.
>>> 
>>> Solr requires very less heap memory for its operation and more memory for
>>> storing data in main memory. This is because solr uses mmap for storing the
>>> index files.
>>> Please check the link
>>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html for
>>> understanding how solr operates on files .
>>> 
>>> Solr has typical problem of Garbage collection once you the heap size to a
>>> large value. It will have indeterminate pauses due to GC. The amount of
>>> heap memory required is difficult to tell. However the way we tuned this
>>> parameter is setting it to a low value and increasing it by 1Gb whenever
>>> OOM is thrown.
>>> 
>>> Please check the problem of having large Java Heap
>>> 
>>> http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
>>> 
>>> 
>>> Just for your reference, in our production setup, we have data of around
>>> 60Gb/node spread across 25 collections. We have configured 8GB as heap and
>>> the rest of the memory we will leave it to OS to manage. We do around 1000
>>> (search + Insert)/second on the data.
>>> 
>>> I hope this helps.
>>> 
>>> Regards,
>>> Rahul
>>> 
>>> 
>>> 
>>> On Tue, Dec 15, 2015 at 4:33 PM, zhenglingyun <konghuaru...@163.com> wrote:
>>> 
>>>> Hi, list
>>>> 
>>>> I’m new to solr. Recently I encounter a “memory leak” problem with
>>>> solrcloud.
>>>> 
>>>> I have two 64GB servers running a solrcloud cluster. In the solrcloud, I
>>>> have
>>>> one collection with about 400k docs. The index size of the collection is
>>>> about
>>>> 500MB. Memory for solr is 16GB.
>>>> 
>>>> Following is "ps aux | grep solr” :
>>>> 
>>>> /usr/java/jdk1.7.0_67-cloudera/bin/java
>>>> -Djava.util.logging.config.file=/var/lib/solr/tomcat-deployment/conf/logging.properties
>>>> -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
>>>> -Djava.net.preferIPv4Stack=true -Dsolr.hdfs.blockcache.enabled=true
>>>> -Dsolr.hdfs.blockcache.direct.memory.allocation=true
>>>> -Dsolr.hdfs.blockcache.blocksperbank=16384
>>>> -Dsolr.hdfs.blockcache.slab.count=1 -Xms16608395264 -Xmx16608395264
>>>> -XX:MaxDirectMemorySize=21590179840 -XX:+UseParNewGC
>>>> -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
>>>> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
>>>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
>>>> -Xloggc:/var/log/solr/gc.log
>>>> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh 
>>>> -DzkHost=
>>>> bjzw-datacenter-hadoop-160.d.yourmall.cc:2181,
>>>> bjzw-datacenter-hadoop-163.d.yourmall.cc:2181,
>>>> bjzw-datacenter-hadoop-164.d.yourmall.cc:2181/solr
>>>> -Dsolr.solrxml.location=zookeeper -Dsolr.hdfs.home=hdfs://datacenter/solr
>>>> -Dsolr.hdfs.confdir=/var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/hadoop-conf
>>>> -Dsolr.authentication.simple.anonymous.allowed=true
>>>> -Dsolr.security.proxyuser.hue.hosts=*
>>>> -Dsolr.security.proxyuser.hue.groups=* -Dhost=
>>>> bjzw-datacenter-solr-15.d.yourmall.cc -Djetty.port=8983 -Dsolr.host=
>>>> bjzw-datacenter-solr-15.d.yourmall.cc -Dsolr.port=8983
>>>> -Dlog4j.configuration=file:///var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/log4j.properties
>>>> -Dsolr.log=/var/log/solr -Dsolr.admin.port=8984
>>>> -Dsolr.max.connector.thread=10000 -Dsolr.solr.home=/var/lib/solr
>>>> -Djava.net.preferIPv4Stack=true -Dsolr.hdfs.blockcache.enabled=true
>>>> -Dsolr.hdfs.blockcache.direct.memory.allocation=true
>>>> -Dsolr.hdfs.blockcache.blocksperbank=16384
>>>> -Dsolr.hdfs.blockcache.slab.count=1 -Xms16608395264 -Xmx16608395264
>>>> -XX:MaxDirectMemorySize=21590179840 -XX:+UseParNewGC
>>>> -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
>>>> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
>>>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
>>>> -Xloggc:/var/log/solr/gc.log
>>>> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh 
>>>> -DzkHost=
>>>> bjzw-datacenter-hadoop-160.d.yourmall.cc:2181,
>>>> bjzw-datacenter-hadoop-163.d.yourmall.cc:2181,
>>>> bjzw-datacenter-hadoop-164.d.yourmall.cc:2181/solr
>>>> -Dsolr.solrxml.location=zookeeper -Dsolr.hdfs.home=hdfs://datacenter/solr
>>>> -Dsolr.hdfs.confdir=/var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/hadoop-conf
>>>> -Dsolr.authentication.simple.anonymous.allowed=true
>>>> -Dsolr.security.proxyuser.hue.hosts=*
>>>> -Dsolr.security.proxyuser.hue.groups=* -Dhost=
>>>> bjzw-datacenter-solr-15.d.yourmall.cc -Djetty.port=8983 -Dsolr.host=
>>>> bjzw-datacenter-solr-15.d.yourmall.cc -Dsolr.port=8983
>>>> -Dlog4j.configuration=file:///var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/log4j.properties
>>>> -Dsolr.log=/var/log/solr -Dsolr.admin.port=8984
>>>> -Dsolr.max.connector.thread=10000 -Dsolr.solr.home=/var/lib/solr
>>>> -Djava.endorsed.dirs=/usr/lib/bigtop-tomcat/endorsed -classpath
>>>> /usr/lib/bigtop-tomcat/bin/bootstrap.jar
>>>> -Dcatalina.base=/var/lib/solr/tomcat-deployment
>>>> -Dcatalina.home=/usr/lib/bigtop-tomcat -Djava.io.tmpdir=/var/lib/solr/
>>>> org.apache.catalina.startup.Bootstrap start
>>>> 
>>>> 
>>>> solr version is solr4.4.0-cdh5.3.0
>>>> jdk version is 1.7.0_67
>>>> 
>>>> Soft commit time is 1.5s. And we have real time indexing/partialupdating
>>>> rate about 100 docs per second.
>>>> 
>>>> When fresh started, Solr will use about 500M memory(the memory show in
>>>> solr ui panel).
>>>> After several days running, Solr will meet with long time gc problems, and
>>>> no response to user query.
>>>> 
>>>> During solr running, the memory used by solr is keep increasing until some
>>>> large value, and decrease to
>>>> a low level(because of gc), and keep increasing until a larger value
>>>> again, then decrease to a low level again … and keep
>>>> increasing to an more larger value … until solr has no response and i
>>>> restart it.
>>>> 
>>>> 
>>>> I don’t know how to solve this problem. Can you give me some advices?
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>>

Re: solrcloud used a lot of memory and memory keep increasing during long time run

Reply via email to