On 11/18/2016 6:00 AM, Sebastian Riemer wrote: > I am looking to improve indexing speed when loading many documents as part of > an import. I am using the SolrJ-Client and currently I add the documents > one-by-one using HttpSolrClient and its method add(SolrInputDocument doc, > int commitWithinMs).
If you batch them (probably around 500 to 1000 at a time), indexing speed will go up. Below you have described the add methods used for batching. > My first step would be to change that to use > add(Collection<SolrInputDocument> docs, int commitWithinMs) instead, which I > expect would already improve performance. > Does it matter which method I use? Beside the method taking a > Collection<SolrInputDocument> there is also one that takes an > Iterator<SolrInputDocument> ... and what about ConcurrentUpdateSolrClient? > Should I use it for bulk indexing instead of HttpSolrClient? > > Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. > only one instance etc. > Indexing 39657 documents (which result in a core size of appr. 127MB) took > about 10 minutes with the one-by-one approach. The concurrent client will send updates in parallel, without any threading code in your own program, but there is one glaring disadvantage -- indexing failures will be logged (via SLF4J), but your program will NOT be informed about them, which means that the entire Solr cluster could be down, and all your indexing requests will still appear to succeed from your program's point of view. Here's an issue I filed on the problem. It hasn't been fixed because there really isn't a good solution. https://issues.apache.org/jira/browse/SOLR-3284 The concurrent client swallows all exceptions that occur during add() operations -- they are conducted in the background. This might also happen during delete operations, though I am unsure about that. You won't know about any problems unless those problems are still there when your program tries an operation that can't happen in the background, like commit or query. If you're relying on automatic commits, your indexing program might NEVER become aware of problems on the server end. In a nutshell ... the concurrent client is great for initial bulk loading (if and only if you don't need error detection), but not all that useful for ongoing update activity that runs all the time. If you set up multiple indexing threads in your own program, you can use HttpSolrClient or CloudSolrClient with similar concurrent effectiveness to the concurrent client, without sacrificing the ability to detect errors during indexing. Indexing 40K documents in batches should take very little time, and in my opinion is not worth the disadvantages of the concurrent client, or taking the time to write multi-threaded code. If you reach the point where you've got millions of documents, then you might want to consider writing multi-threaded indexing code. Thanks, Shawn