Here's some numbers for batching improvements: https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
And I totally agree with Shawn that for 40K documents anything more complex is probably overkill. Best, Erick On Fri, Nov 18, 2016 at 6:02 AM, Shawn Heisey <apa...@elyograg.org> wrote: > On 11/18/2016 6:00 AM, Sebastian Riemer wrote: >> I am looking to improve indexing speed when loading many documents as part >> of an import. I am using the SolrJ-Client and currently I add the documents >> one-by-one using HttpSolrClient and its method add(SolrInputDocument doc, >> int commitWithinMs). > > If you batch them (probably around 500 to 1000 at a time), indexing > speed will go up. Below you have described the add methods used for > batching. > >> My first step would be to change that to use >> add(Collection<SolrInputDocument> docs, int commitWithinMs) instead, which I >> expect would already improve performance. >> Does it matter which method I use? Beside the method taking a >> Collection<SolrInputDocument> there is also one that takes an >> Iterator<SolrInputDocument> ... and what about ConcurrentUpdateSolrClient? >> Should I use it for bulk indexing instead of HttpSolrClient? >> >> Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. >> only one instance etc. >> Indexing 39657 documents (which result in a core size of appr. 127MB) took >> about 10 minutes with the one-by-one approach. > > The concurrent client will send updates in parallel, without any > threading code in your own program, but there is one glaring > disadvantage -- indexing failures will be logged (via SLF4J), but your > program will NOT be informed about them, which means that the entire > Solr cluster could be down, and all your indexing requests will still > appear to succeed from your program's point of view. Here's an issue I > filed on the problem. It hasn't been fixed because there really isn't a > good solution. > > https://issues.apache.org/jira/browse/SOLR-3284 > > The concurrent client swallows all exceptions that occur during add() > operations -- they are conducted in the background. This might also > happen during delete operations, though I am unsure about that. You > won't know about any problems unless those problems are still there when > your program tries an operation that can't happen in the background, > like commit or query. If you're relying on automatic commits, your > indexing program might NEVER become aware of problems on the server end. > > In a nutshell ... the concurrent client is great for initial bulk > loading (if and only if you don't need error detection), but not all > that useful for ongoing update activity that runs all the time. > > If you set up multiple indexing threads in your own program, you can use > HttpSolrClient or CloudSolrClient with similar concurrent effectiveness > to the concurrent client, without sacrificing the ability to detect > errors during indexing. > > Indexing 40K documents in batches should take very little time, and in > my opinion is not worth the disadvantages of the concurrent client, or > taking the time to write multi-threaded code. If you reach the point > where you've got millions of documents, then you might want to consider > writing multi-threaded indexing code. > > Thanks, > Shawn >