Here's some numbers for batching improvements:

https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

And I totally agree with Shawn that for 40K documents anything more
complex is probably overkill.

Best,
Erick

On Fri, Nov 18, 2016 at 6:02 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 11/18/2016 6:00 AM, Sebastian Riemer wrote:
>> I am looking to improve indexing speed when loading many documents as part 
>> of an import. I am using the SolrJ-Client and currently I add the documents 
>> one-by-one using HttpSolrClient and  its method add(SolrInputDocument doc, 
>> int commitWithinMs).
>
> If you batch them (probably around 500 to 1000 at a time), indexing
> speed will go up.  Below you have described the add methods used for
> batching.
>
>> My first step would be to change that to use 
>> add(Collection<SolrInputDocument> docs, int commitWithinMs) instead, which I 
>> expect would already improve performance.
>> Does it matter which method I use? Beside the method taking a 
>> Collection<SolrInputDocument> there is also one that takes an 
>> Iterator<SolrInputDocument> ... and what about ConcurrentUpdateSolrClient? 
>> Should I use it for bulk indexing instead of HttpSolrClient?
>>
>> Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. 
>> only one instance etc.
>> Indexing 39657 documents (which result in a core size of appr. 127MB) took 
>> about 10 minutes with the one-by-one approach.
>
> The concurrent client will send updates in parallel, without any
> threading code in your own program, but there is one glaring
> disadvantage -- indexing failures will be logged (via SLF4J), but your
> program will NOT be informed about them, which means that the entire
> Solr cluster could be down, and all your indexing requests will still
> appear to succeed from your program's point of view.  Here's an issue I
> filed on the problem.  It hasn't been fixed because there really isn't a
> good solution.
>
> https://issues.apache.org/jira/browse/SOLR-3284
>
> The concurrent client swallows all exceptions that occur during add()
> operations -- they are conducted in the background.  This might also
> happen during delete operations, though I am unsure about that.  You
> won't know about any problems unless those problems are still there when
> your program tries an operation that can't happen in the background,
> like commit or query.  If you're relying on automatic commits, your
> indexing program might NEVER become aware of problems on the server end.
>
> In a nutshell ... the concurrent client is great for initial bulk
> loading (if and only if you don't need error detection), but not all
> that useful for ongoing update activity that runs all the time.
>
> If you set up multiple indexing threads in your own program, you can use
> HttpSolrClient or CloudSolrClient with similar concurrent effectiveness
> to the concurrent client, without sacrificing the ability to detect
> errors during indexing.
>
> Indexing 40K documents in batches should take very little time, and in
> my opinion is not worth the disadvantages of the concurrent client, or
> taking the time to write multi-threaded code.  If you reach the point
> where you've got millions of documents, then you might want to consider
> writing multi-threaded indexing code.
>
> Thanks,
> Shawn
>

Reply via email to