Re: Indexing huge data

Erick Erickson Fri, 07 Mar 2014 17:55:02 -0800

Kranti and Susheel's appoaches are certainly
reasonable assuming I bet right :).


Another strategy is to rack together N
indexing programs that simultaneously
feed Solr.

In any of these scenarios, the end goal is to get
Solr using up all the CPU cycles it can, _assuming_
that Solr isn't the bottleneck in the first place.

Best,
Erick

On Thu, Mar 6, 2014 at 6:38 PM, Kranti Parisa <kranti.par...@gmail.com> wrote:
> thats what I do. precreate JSONs following the schema, saving that in
> MongoDB, this is part of the ETL process. after that, just dump the JSONs
> into Solr using batching etc. with this you can do full and incremental
> indexing as well.
>
> Thanks,
> Kranti K. Parisa
> http://www.linkedin.com/in/krantiparisa
>
>
>
> On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu <rallav...@gmail.com> wrote:
>
>> Yeah. I have thought about spitting out JSON and run it against Solr using
>> parallel Http threads separately. Thanks.
>>
>>
>> On 3/5/14, 6:46 PM, Susheel Kumar wrote:
>>
>>> One more suggestion is to collect/prepare the data in CSV format (1-2
>>> million sample depending on size) and then import data direct into Solr
>>> using CSV handler & curl.  This will give you the pure indexing time & the
>>> differences.
>>>
>>> Thanks,
>>> Susheel
>>>
>>> -----Original Message-----
>>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>>> Sent: Wednesday, March 05, 2014 8:03 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Indexing huge data
>>>
>>> Here's the easiest thing to try to figure out where to concentrate your
>>> energies..... Just comment out the server.add call in your SolrJ program.
>>> Well, and any commits you're doing from SolrJ.
>>>
>>> My bet: Your program will run at about the same speed it does when you
>>> actually index the docs, indicating that your problem is in the data
>>> acquisition side. Of course the older I get, the more times I've been wrong
>>> :).
>>>
>>> You can also monitor the CPU usage on the box running Solr. I often see
>>> it idling along < 30% when indexing, or even < 10%, again indicating that
>>> the bottleneck is on the acquisition side.
>>>
>>> Note I haven't mentioned any solutions, I'm a believer in identifying the
>>> _problem_ before worrying about a solution.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <j...@basetechnology.com>
>>> wrote:
>>>
>>>> Make sure you're not doing a commit on each individual document add.
>>>> Commit every few minutes or every few hundred or few thousand
>>>> documents is sufficient. You can set up auto commit in solrconfig.xml.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -----Original Message----- From: Rallavagu
>>>> Sent: Wednesday, March 5, 2014 2:37 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Indexing huge data
>>>>
>>>>
>>>> All,
>>>>
>>>> Wondering about best practices/common practices to index/re-index huge
>>>> amount of data in Solr. The data is about 6 million entries in the db
>>>> and other source (data is not located in one resource). Trying with
>>>> solrj based solution to collect data from difference resources to
>>>> index into Solr. It takes hours to index Solr.
>>>>
>>>> Thanks in advance
>>>>
>>>

Re: Indexing huge data

Reply via email to