Re: Lost connection to Zookeeper

Joseph Obernberger Fri, 05 Jun 2015 06:59:42 -0700

Thank you Shawn! Yes - it is now a Solr 5.1.0 cloud on 27 nodes and weuse the startup scripts. The current index size is 3.0T - about 115Gper node - index is stored in HDFS which is spread across those 27 nodesand about (a guess) - 256 spindles. Each node has 26G of HDFS cache(MaxDirectMemorySize) allocated to Solr. Zookeeper storage is on localdisk. Solr and HDFS run on the same machines. Each node is connected toa switch over 1G Ethernet, but the backplane is 40G.Do you think the clusterstatus and the zookeeper timeouts are related toperformance issues talking to HDFS?


The JVM parameters are:
-----------------------------------------
-DSTOP.KEY=solrrocks
-DSTOP.PORT=8100
-Dhost=helios
-Djava.net.preferIPv4Stack=true
-Djetty.port=9100
-DnumShards=27
-Dsolr.clustering.enabled=true
-Dsolr.install.dir=/opt/solr
-Dsolr.lock.type=hdfs
-Dsolr.solr.home=/opt/solr/server/solr
-Duser.timezone=UTC-DzkClientTimeout=15000

-DzkHost=eris.querymasters.com:2181,daphnis.querymasters.com:2181,triton.querymasters.com:2181,oberon.querymasters.com:2181,portia.querymasters.com:2181,puck.querymasters.com:2181/solr5

-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:+UseLargePages
-XX:+UseParNewGC-XX:CMSFullGCsBeforeCompaction=1
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:CMSTriggerPermRatio=80
-XX:ConcGCThreads=8
-XX:MaxDirectMemorySize=26g
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 9100 /opt/solr/server/logs
-XX:ParallelGCThreads=8
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90
-Xloggc:/opt/solr/server/logs/solr_gc.log
-Xms8g
-Xmx16g
-Xss256k
-verbose:gc
--------------------


The directoryFactory is configured as follows:

<directoryFactory name="DirectoryFactory"
        class="solr.HdfsDirectoryFactory">
        <bool name="solr.hdfs.blockcache.enabled">true</bool>
        <int name="solr.hdfs.blockcache.slab.count">200</int>

<boolname="solr.hdfs.blockcache.direct.memory.allocation">true</bool>

        <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
        <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
        <bool name="solr.hdfs.blockcache.write.enabled">false</bool>
        <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
        <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">64</int>
        <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">512</int>
        <str name="solr.hdfs.home">hdfs://nameservice1:8020/solr5</str>
        <str name="solr.hdfs.confdir">/etc/hadoop/conf.cloudera.hdfs1</str>
    </directoryFactory>

-Joe

On 6/5/2015 9:34 AM, Shawn Heisey wrote:

On 6/3/2015 6:39 PM, Joseph Obernberger wrote:

Hi All - I've run into a problem where every-once in a while one or more
of the shards (27 shard cluster) will loose connection to zookeeper and
report "updates are disabled".  In additional to the CLUSTERSTATUS
timeout errors, which don't seem to cause any issue, this one certainly
does as that shard no longer takes any (you guessed it!) updates!
We are using Zookeeper with 7 nodes (7 servers in our quorum).
There stack trace is:

Other messages you have sent talk about Solr 5.x, and one of them
mentions a 16-node cluster with a 2.9 terabyte index, with the index
data stored on HDFS.

I'm going to venture a guess that you don't have anywhere near enough
RAM for proper disk caching, leading to general performance issues,
which ultimately cause timeouts.  With HDFS, I'm not sure whether OS
disk cache on the Solr server matters very much, or whether that needs
to be on the HDFS servers.  I would guess the latter.  Also, if your
storage networking is gigabit or slower, HDFS may have significantly
more latency than local storage.  For good network storage speed, you
want 10gig ethernet or Infiniband.

If it's Solr 5.x and you are using the included startup scripts, then
long GC pauses are probably not a major issue.  The startup scripts
include significant GC tuning. If you have deployed in your own
container, GC tuning might be an issue -- it is definitely required.

Here is where I have written down everything I've learned about Solr
performance problems, most of which are due to one problem or another
with memory:

https://wiki.apache.org/solr/SolrPerformanceProblems

Is your zookeeper database on local storage or HDFS?  I would suggest
keeping that on local storage for optimal performance.

Thanks,
Shawn

Re: Lost connection to Zookeeper

Reply via email to