On 1/26/2015 9:34 PM, Vijay Sekhri wrote: > Hi Shawn, Erick > So it turned out that once we increased our indexing rate to the original > full indexing rate the replicas went back into recovery no matter what the > zk timeout setting was. Initially we though that increasing the timeout is > helping but apparently not . We just decreased indexing rate and that > caused less replicas to go in recovery. Once we have our full indexing rate > almost all replicas went into recovery no matter what the zk timeout or the > ticktime setting were. We reverted back the ticktime to original 2 seconds > > So we investigated further and after checking the logs we found this > exception happening right before the recovery process is initiated. We > observed this on two different replicas that went into recovery. We are not > sure if this is a coincidence or a real problem . Notice we were also > putting some search query load while indexing to trigger the recovery > behavior
<snip> > 22:00:40,861 ERROR [org.apache.solr.core.SolrCore] > (http-/10.235.46.36:8580-32) > ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk > header* One possibility that my searches on that exception turned up is that this is some kind of a problem in the servlet container, and the information I can see suggests it may be a bug in JBoss, and the underlying cause is changes in newer releases of Java 7. Your stacktraces do seem to mention jboss classes, so that seems likely. The reason that we only recommend running under the Jetty that comes with Solr, which has a tuned config, is because that's the only servlet container that actually gets tested. https://bugzilla.redhat.com/show_bug.cgi?id=1104273 https://bugzilla.redhat.com/show_bug.cgi?id=1154028 I can't really verify any other possibility. Thanks, Shawn