We ran into this during our indexing process running on 4.10.3. After increasing zookeeper timeouts, client timeouts, socket timeouts, implementing retry logic on our loading process the thing that worked was to change the Hard Commit timing. We were performing a Hard Commit every 5 minutes and after a couple hours of loading data some of the shards would start going down because they would timeout with zookeeper and/or close connections. Changing the timeouts just moved the problem later in the ingest process.
Through a combination of decreasing the hard commit timing to 15 seconds, and migrating to G1 garbage collect, we are able to prevent ingest failures. For us the periodic stop the world garbage collects were causing connections to be closed and other nasty things such as zookeeper timeouts that would cause recovery to kick in. (Soft commits are turned off until the full ingest/baseline completes). I believe until a Hard Commit is issued Solr keeps the data in memory which explains why we were experiencing nasty garbage collects. The other change we made which may have helped is that we ensured the socket timeouts were in sync between the jetty instance running Solr and the SolrJ loading the data. During some of our batch updates Solr would take a couple minutes to respond back which I believe in some instances the socket server side would be closed (maxIdleTime setting in Jetty). Hope this helps, Jaime Spicciati Thanks Jaime On Tue, Apr 14, 2015 at 9:26 AM, vsilgalis <vsilga...@gmail.com> wrote: > Right now index size is about 10GB on each shard (yes I could use more > RAM), > but I'm looking more for a step up then step down approach. I will try > adding more RAM to these machines as my next step. > > 1. Zookeeper is external to these boxes in a three node cluster with more > than enough RAM to keep everything off disk. > > 2. os disk cache, when I add more RAM I will just add it as RAM for the > machine and not to the Java Heap unless that is something you recommend. > > 3. java heap looks good so far, GC is minimal as far as i can tell but I > can > look into this some more. > > 4. we do have 2 cores per machine, but the second core is a joke (10MB) > > note: zkClientTimeout is set to 30 for safety's sake. > > java settings: > > -XX:+CMSClassUnloadingEnabled-XX:+AggressiveOpts-XX:+ParallelRefProcEnabled-XX:+CMSParallelRemarkEnabled-XX:CMSMaxAbortablePrecleanTime=6000-XX:CMSTriggerPermRatio=80-XX:CMSInitiatingOccupancyFraction=50-XX:+UseCMSInitiatingOccupancyOnly-XX:CMSFullGCsBeforeCompaction=1-XX:PretenureSizeThreshold=64m-XX:+CMSScavengeBeforeRemark-XX:ParallelGCThreads=4-XX:ConcGCThreads=4-XX:+UseConcMarkSweepGC-XX:+UseParNewGC-XX:MaxTenuringThreshold=8-XX:TargetSurvivorRatio=90-XX:SurvivorRatio=4-XX:NewRatio=3-XX:-UseSuperWord-Xmx5588m-Xms1596m > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Java-net-socketexception-broken-pipe-Solr-4-10-2-tp4199484p4199561.html > Sent from the Solr - User mailing list archive at Nabble.com. >