On 11/18/2016 6:00 AM, Sebastian Riemer wrote:
> I am looking to improve indexing speed when loading many documents as part of 
> an import. I am using the SolrJ-Client and currently I add the documents 
> one-by-one using HttpSolrClient and  its method add(SolrInputDocument doc, 
> int commitWithinMs).

If you batch them (probably around 500 to 1000 at a time), indexing
speed will go up.  Below you have described the add methods used for
batching.

> My first step would be to change that to use 
> add(Collection<SolrInputDocument> docs, int commitWithinMs) instead, which I 
> expect would already improve performance.
> Does it matter which method I use? Beside the method taking a 
> Collection<SolrInputDocument> there is also one that takes an 
> Iterator<SolrInputDocument> ... and what about ConcurrentUpdateSolrClient? 
> Should I use it for bulk indexing instead of HttpSolrClient?
>
> Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. 
> only one instance etc.
> Indexing 39657 documents (which result in a core size of appr. 127MB) took 
> about 10 minutes with the one-by-one approach.

The concurrent client will send updates in parallel, without any
threading code in your own program, but there is one glaring
disadvantage -- indexing failures will be logged (via SLF4J), but your
program will NOT be informed about them, which means that the entire
Solr cluster could be down, and all your indexing requests will still
appear to succeed from your program's point of view.  Here's an issue I
filed on the problem.  It hasn't been fixed because there really isn't a
good solution.

https://issues.apache.org/jira/browse/SOLR-3284

The concurrent client swallows all exceptions that occur during add()
operations -- they are conducted in the background.  This might also
happen during delete operations, though I am unsure about that.  You
won't know about any problems unless those problems are still there when
your program tries an operation that can't happen in the background,
like commit or query.  If you're relying on automatic commits, your
indexing program might NEVER become aware of problems on the server end.

In a nutshell ... the concurrent client is great for initial bulk
loading (if and only if you don't need error detection), but not all
that useful for ongoing update activity that runs all the time.

If you set up multiple indexing threads in your own program, you can use
HttpSolrClient or CloudSolrClient with similar concurrent effectiveness
to the concurrent client, without sacrificing the ability to detect
errors during indexing.

Indexing 40K documents in batches should take very little time, and in
my opinion is not worth the disadvantages of the concurrent client, or
taking the time to write multi-threaded code.  If you reach the point
where you've got millions of documents, then you might want to consider
writing multi-threaded indexing code.

Thanks,
Shawn

Reply via email to