Yeah. I have thought about spitting out JSON and run it against Solr
using parallel Http threads separately. Thanks.
On 3/5/14, 6:46 PM, Susheel Kumar wrote:
One more suggestion is to collect/prepare the data in CSV format (1-2 million sample
depending on size) and then import data direct into Solr using CSV handler & curl.
This will give you the pure indexing time & the differences.
Thanks,
Susheel
-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, March 05, 2014 8:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing huge data
Here's the easiest thing to try to figure out where to concentrate your
energies..... Just comment out the server.add call in your SolrJ program. Well,
and any commits you're doing from SolrJ.
My bet: Your program will run at about the same speed it does when you actually
index the docs, indicating that your problem is in the data acquisition side.
Of course the older I get, the more times I've been wrong :).
You can also monitor the CPU usage on the box running Solr. I often see it idling
along < 30% when indexing, or even < 10%, again indicating that the bottleneck
is on the acquisition side.
Note I haven't mentioned any solutions, I'm a believer in identifying the
_problem_ before worrying about a solution.
Best,
Erick
On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <j...@basetechnology.com> wrote:
Make sure you're not doing a commit on each individual document add.
Commit every few minutes or every few hundred or few thousand
documents is sufficient. You can set up auto commit in solrconfig.xml.
-- Jack Krupansky
-----Original Message----- From: Rallavagu
Sent: Wednesday, March 5, 2014 2:37 PM
To: solr-user@lucene.apache.org
Subject: Indexing huge data
All,
Wondering about best practices/common practices to index/re-index huge
amount of data in Solr. The data is about 6 million entries in the db
and other source (data is not located in one resource). Trying with
solrj based solution to collect data from difference resources to
index into Solr. It takes hours to index Solr.
Thanks in advance