Re: Indexing huge data

Rallavagu Sat, 08 Mar 2014 10:08:23 -0800

Thanks for all responses so far. Test runs so far does not suggest anybottleneck with Solr yet as I continue to work on different approaches.Collecting the data from different sources seems to be consuming most ofthe time.


On 3/7/14, 5:53 PM, Erick Erickson wrote:

Kranti and Susheel's appoaches are certainly
reasonable assuming I bet right :).


Another strategy is to rack together N
indexing programs that simultaneously
feed Solr.

In any of these scenarios, the end goal is to get
Solr using up all the CPU cycles it can, _assuming_
that Solr isn't the bottleneck in the first place.

Best,
Erick

On Thu, Mar 6, 2014 at 6:38 PM, Kranti Parisa <kranti.par...@gmail.com> wrote:

thats what I do. precreate JSONs following the schema, saving that in
MongoDB, this is part of the ETL process. after that, just dump the JSONs
into Solr using batching etc. with this you can do full and incremental
indexing as well.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu <rallav...@gmail.com> wrote:

Yeah. I have thought about spitting out JSON and run it against Solr using
parallel Http threads separately. Thanks.


On 3/5/14, 6:46 PM, Susheel Kumar wrote:

One more suggestion is to collect/prepare the data in CSV format (1-2
million sample depending on size) and then import data direct into Solr
using CSV handler & curl.  This will give you the pure indexing time & the
differences.

Thanks,
Susheel

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, March 05, 2014 8:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing huge data

Here's the easiest thing to try to figure out where to concentrate your
energies..... Just comment out the server.add call in your SolrJ program.
Well, and any commits you're doing from SolrJ.

My bet: Your program will run at about the same speed it does when you
actually index the docs, indicating that your problem is in the data
acquisition side. Of course the older I get, the more times I've been wrong
:).

You can also monitor the CPU usage on the box running Solr. I often see
it idling along < 30% when indexing, or even < 10%, again indicating that
the bottleneck is on the acquisition side.

Note I haven't mentioned any solutions, I'm a believer in identifying the
_problem_ before worrying about a solution.

Best,
Erick

On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <j...@basetechnology.com>
wrote:

Make sure you're not doing a commit on each individual document add.
Commit every few minutes or every few hundred or few thousand
documents is sufficient. You can set up auto commit in solrconfig.xml.

-- Jack Krupansky

-----Original Message----- From: Rallavagu
Sent: Wednesday, March 5, 2014 2:37 PM
To: solr-user@lucene.apache.org
Subject: Indexing huge data

All,

Wondering about best practices/common practices to index/re-index huge
amount of data in Solr. The data is about 6 million entries in the db
and other source (data is not located in one resource). Trying with
solrj based solution to collect data from difference resources to
index into Solr. It takes hours to index Solr.

Thanks in advance

Re: Indexing huge data

Reply via email to