I have a project where I am porting existing application from direct Lucene API usage to using SOLR and SOLRJ client API.
The problem I have is that indexing is 2-5x slower using SOLRJ+SOLR than using direct Lucene API. I am creating batches of documents between 200 and 500 documents per call to add() using SOLRJ. I tried adjusting SOLR parameters for indexing but did not make any difference. Documents are identical (same fields) in both cases. Nearly identical settings for tokenizing/analyzing/indexing/storing for each field with Lucene and SOLR. What could be the possible bottleneck in this case? Can there significant over-head unpacking batch of documents in request? Is there some SOLR over-head in update handler? I have tried both SOLR 3.6 and 4.0 with very similar results. When using SOLR 4.0 I have transaction logging (for NRT search) turned off. I am also NOT using a unique ID field. Performance for indexing 200 documents is around 250ms on SOLR, about 60ms on Lucene. I see that response time wrapping call to SOLRJ API add() method, and response time logged in SOLR log is nearly the same, so there is very little network overhead in this case. Is this typical amount of overhead to use SOLRJ+SOLR vs local Lucene API? The reason it matters in this case is application needs to rebuilt index once per day which currently takes about 45 minutes. Using SOLRJ+SOLR it will take several hours, which is a show stopper in this case. Thanks.