Re: How to do multi-threading indexing on huge volume of JSON files?

Erick Erickson Tue, 08 May 2018 21:31:20 -0700

I'd seriously consider a SolrJ program rather than posting, posting
files is really intended to be a simple way to get started, when it
comes to indexing large volumes it's not very efficient.

As a comparison, I index 3-4K docs/second (Wikipedia dump) on my macbook pro.

Note that if each of your businesses has that many documents, you're
talking 12 billion, hope you're sharding!

Here's some SolrJ to get you started. Note you'll pretty much throw
out the Tika and RDBMS in favor of constructing the SolrInputDocuments
from parsing your data with your favorite JSON parser.

https://lucidworks.com/2012/02/14/indexing-with-solrj/

Then you can rack N of these SolrJ programs (each presumably working
on a separate subset of the data) to get your indexing speed up to
what you need.

95% of the time slow indexing is because of the ETL pipeline. One key
is to check the CPU usage on your Solr server and see if it's running
hot or not. If not, then you aren't feeding docs fast enough to Solr.

Do batch docs together as in the program, I typically start with
batches of 1,000 docs.

Best,
Erick

On Tue, May 8, 2018 at 8:25 PM, Raymond Xie <xie3208...@gmail.com> wrote:
> I have a huge amount of JSON files to be indexed in Solr, it costs me 22
> minutes to index 300,000 JSON files which were generated from 1 single bz2
> file, this is only 0.25% of the total amount of data from the same business
> flow, there are 100+ business flow to be index'ed.
>
> I absolutely need a good solution on this, at the moment I use the post.jar
> to work on folder and I am running the post.jar in single thread.
>
> I wonder what is the best practice to do multi-threading indexing? Can
> anyone provide detailed example?
>
>
>
> *------------------------------------------------*
> *Sincerely yours,*
>
>
> *Raymond*

Re: How to do multi-threading indexing on huge volume of JSON files?

Reply via email to