On 5/22/2013 9:08 AM, Justin Babuscio wrote:
We periodically rebuild our Solr index from scratch. We have built a
custom publisher that horizontally scales to increase write throughput. On
a given rebuild, we will have ~60 JVMs running with 5 threads that are
actively publishing to all Solr masters.
For each thread, we instantiate one StreamingUpdateSolrServer(
QueueSize:100, QueueThreadSize: 2 ) for each master = 20 servers/thread.
Looking over all your details, you might want to try first reducing the
maxFieldLength to slightly below Integer.MAX_VALUE. Try setting it to 2
billion, or even something more modest, in the millions. It's
theoretically possible that the other value might be leading to an
overflow somewhere. I've been looking for evidence of this, nothing's
turned up yet.
There MIGHT be bugs in the Apache Commons libraries that SolrJ uses.
The next thing I would try is upgrading those component jars in your
application's classpath - httpclient, commons-io, commons-codec, etc.
Upgrading to a newer SolrJ version is also a good idea. Your notes
imply that you are using the default XML request writer in SolrJ. If
that's true, you should be able to use a 4.3 SolrJ even with an older
Solr version, which would give you a server object that's based on
HttpComponents 4.x, where your current objects are based on HttpClient
3.x. You would need to make adjustments in your source code. If you're
not using the default XML request writer, you can get a similar change
by using SolrJ 3.6.2.
IMHO you should switch to HttpSolrServer (CommonsHttpSolrServer in SolrJ
3.5 and earlier). StreamingUpdateSolrServer (and its replacement in 3.6
and later, named ConcurrentUpdateSolrServer) has one glaring problem -
it never informs the calling application about any errors that it
encounters during indexing. It lies to you, and tells you that
everything has succeeded even when it doesn't.
The one advantage that SUSS/CUSS has over its Http sibling is that it is
multi-threaded, so it can send updates concurrently. You seem to know
enough about how it works, so I'll just say that you don't need
additional complexity that is not under your control and refuses to
throw exceptions when an error occurs. You already have a large-scale
concurrent and multi-threaded indexing setup, so SolrJ's additional
thread handling doesn't really buy you much.
Thanks,
Shawn