I have a project where I am porting existing application from direct
Lucene API usage to using SOLR and SOLRJ client API.

The problem I have is that indexing is 2-5x slower using SOLRJ+SOLR
than using direct Lucene API.

I am creating batches of documents between 200 and 500 documents per
call to add() using SOLRJ.

I tried adjusting SOLR parameters for indexing but did not make any
difference.

Documents are identical (same fields) in both cases.

Nearly identical settings for tokenizing/analyzing/indexing/storing
for each field with Lucene and SOLR.

What could be the possible bottleneck in this case?   Can there
significant over-head unpacking batch of documents in request?  Is
there some SOLR over-head in update handler?

I have tried both SOLR 3.6 and 4.0 with very similar results.

When using SOLR 4.0 I have transaction logging (for NRT search) turned off.

I am also NOT using a unique ID field.

Performance for indexing 200 documents is around 250ms on SOLR, about
60ms on Lucene.

I see that response time wrapping call to SOLRJ API add() method, and
response time logged in SOLR log is nearly the same, so there is very
little network overhead in this case.

Is this typical amount of overhead to use SOLRJ+SOLR vs local Lucene API?

The reason it matters in this case is application needs to rebuilt
index once per day which currently takes about 45 minutes.  Using
SOLRJ+SOLR it will take several hours, which is a show stopper in this
case.

Thanks.

Reply via email to