We ran into this during our indexing process running on 4.10.3. After
increasing zookeeper timeouts, client timeouts, socket timeouts,
implementing retry logic on our loading process the thing that worked was
to change the Hard Commit timing. We were performing a Hard Commit every 5
minutes and after a couple hours of loading data some of the shards would
start going down because they would timeout with zookeeper and/or close
connections. Changing the timeouts just moved the problem later in the
ingest process.

Through a combination of decreasing the hard commit timing to 15 seconds,
and migrating to G1 garbage collect, we are able to prevent ingest
failures. For us the periodic stop the world garbage collects were causing
connections to be closed and other nasty things such as zookeeper timeouts
that would cause recovery to kick in. (Soft commits are turned off until
the full ingest/baseline completes). I believe until a Hard Commit is
issued Solr keeps the data in memory which explains why we were
experiencing nasty garbage collects.

The other change we made which may have helped is that we ensured the
socket timeouts were in sync between the jetty instance running Solr and
the SolrJ loading the data. During some of our batch updates Solr would
take a couple minutes to respond back which I believe in some instances the
socket server side would be closed (maxIdleTime setting in Jetty).

Hope this helps,
Jaime Spicciati

Thanks
Jaime


On Tue, Apr 14, 2015 at 9:26 AM, vsilgalis <vsilga...@gmail.com> wrote:

> Right now index size is about 10GB on each shard (yes I could use more
> RAM),
> but I'm looking more for a step up then step down approach.  I will try
> adding more RAM to these machines as my next step.
>
> 1. Zookeeper is external to these boxes in a three node cluster with more
> than enough RAM to keep everything off disk.
>
> 2. os disk cache, when I add more RAM I will just add it as RAM for the
> machine and not to the Java Heap unless that is something you recommend.
>
> 3. java heap looks good so far, GC is minimal as far as i can tell but I
> can
> look into this some more.
>
> 4. we do have 2 cores per machine, but the second core is a joke (10MB)
>
> note: zkClientTimeout is set to 30 for safety's sake.
>
> java settings:
>
> -XX:+CMSClassUnloadingEnabled-XX:+AggressiveOpts-XX:+ParallelRefProcEnabled-XX:+CMSParallelRemarkEnabled-XX:CMSMaxAbortablePrecleanTime=6000-XX:CMSTriggerPermRatio=80-XX:CMSInitiatingOccupancyFraction=50-XX:+UseCMSInitiatingOccupancyOnly-XX:CMSFullGCsBeforeCompaction=1-XX:PretenureSizeThreshold=64m-XX:+CMSScavengeBeforeRemark-XX:ParallelGCThreads=4-XX:ConcGCThreads=4-XX:+UseConcMarkSweepGC-XX:+UseParNewGC-XX:MaxTenuringThreshold=8-XX:TargetSurvivorRatio=90-XX:SurvivorRatio=4-XX:NewRatio=3-XX:-UseSuperWord-Xmx5588m-Xms1596m
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Java-net-socketexception-broken-pipe-Solr-4-10-2-tp4199484p4199561.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply via email to