On 3/17/2017 3:04 AM, vishal jain wrote:
> I am new to Solr and am trying to move data from my RDBMS to Solr. I know the 
> available options are:
> 1) Post Tool
> 2) DIH
> 3) SolrJ (as ours is a J2EE application).
>
> I want to know what is the recommended way for Data import in production
> environment. Will sending data via SolrJ in batches be faster than posting a 
> csv using POST tool?

I've heard that CSV import runs EXTREMELY fast, but I have never tested
it.  The same threading problem that I discuss below would apply to
indexing this way.

DIH is extremely powerful, but it has one glaring problem:  It's
single-threaded, which means that only one stream of data is going into
Solr, and each batch of documents to be inserted must wait for the
previous one to finish inserting before it can start.  I do not know if
DIH batches documents or sends them in one at a time.  If you have a
manually sharded index, you can run DIH on each shard in parallel, but
each one will be single-threaded.  That single thread is pretty
efficient, but it's still only one thread.

Sending multiple index updates to Solr in parallel (multi-threading) is
how you radically speed up the Solr part of indexing.  This is usually
done with a custom indexing program, which might be written with SolrJ
or even in a completely different language.

One thing to keep in mind with ANY indexing method:  Once the situation
is examined closely, most people find that it's not Solr that makes
their indexing slow.  The bottleneck is usually the source system -- how
quickly the data can be retrieved.  It usually takes a lot longer to
obtain the data than it does for Solr to index it.

Thanks,
Shawn

Reply via email to