On 5/22/2013 9:08 AM, Justin Babuscio wrote:
We periodically rebuild our Solr index from scratch.  We have built a
custom publisher that horizontally scales to increase write throughput.  On
a given rebuild, we will have ~60 JVMs running with 5 threads that are
actively publishing to all Solr masters.

For each thread, we instantiate one StreamingUpdateSolrServer(
QueueSize:100, QueueThreadSize: 2 ) for each master = 20 servers/thread.

Looking over all your details, you might want to try first reducing the maxFieldLength to slightly below Integer.MAX_VALUE. Try setting it to 2 billion, or even something more modest, in the millions. It's theoretically possible that the other value might be leading to an overflow somewhere. I've been looking for evidence of this, nothing's turned up yet.

There MIGHT be bugs in the Apache Commons libraries that SolrJ uses. The next thing I would try is upgrading those component jars in your application's classpath - httpclient, commons-io, commons-codec, etc.

Upgrading to a newer SolrJ version is also a good idea. Your notes imply that you are using the default XML request writer in SolrJ. If that's true, you should be able to use a 4.3 SolrJ even with an older Solr version, which would give you a server object that's based on HttpComponents 4.x, where your current objects are based on HttpClient 3.x. You would need to make adjustments in your source code. If you're not using the default XML request writer, you can get a similar change by using SolrJ 3.6.2.

IMHO you should switch to HttpSolrServer (CommonsHttpSolrServer in SolrJ 3.5 and earlier). StreamingUpdateSolrServer (and its replacement in 3.6 and later, named ConcurrentUpdateSolrServer) has one glaring problem - it never informs the calling application about any errors that it encounters during indexing. It lies to you, and tells you that everything has succeeded even when it doesn't.

The one advantage that SUSS/CUSS has over its Http sibling is that it is multi-threaded, so it can send updates concurrently. You seem to know enough about how it works, so I'll just say that you don't need additional complexity that is not under your control and refuses to throw exceptions when an error occurs. You already have a large-scale concurrent and multi-threaded indexing setup, so SolrJ's additional thread handling doesn't really buy you much.

Thanks,
Shawn

Reply via email to