Kranti and Susheel's appoaches are certainly reasonable assuming I bet right :).
Another strategy is to rack together N indexing programs that simultaneously feed Solr. In any of these scenarios, the end goal is to get Solr using up all the CPU cycles it can, _assuming_ that Solr isn't the bottleneck in the first place. Best, Erick On Thu, Mar 6, 2014 at 6:38 PM, Kranti Parisa <kranti.par...@gmail.com> wrote: > thats what I do. precreate JSONs following the schema, saving that in > MongoDB, this is part of the ETL process. after that, just dump the JSONs > into Solr using batching etc. with this you can do full and incremental > indexing as well. > > Thanks, > Kranti K. Parisa > http://www.linkedin.com/in/krantiparisa > > > > On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu <rallav...@gmail.com> wrote: > >> Yeah. I have thought about spitting out JSON and run it against Solr using >> parallel Http threads separately. Thanks. >> >> >> On 3/5/14, 6:46 PM, Susheel Kumar wrote: >> >>> One more suggestion is to collect/prepare the data in CSV format (1-2 >>> million sample depending on size) and then import data direct into Solr >>> using CSV handler & curl. This will give you the pure indexing time & the >>> differences. >>> >>> Thanks, >>> Susheel >>> >>> -----Original Message----- >>> From: Erick Erickson [mailto:erickerick...@gmail.com] >>> Sent: Wednesday, March 05, 2014 8:03 PM >>> To: solr-user@lucene.apache.org >>> Subject: Re: Indexing huge data >>> >>> Here's the easiest thing to try to figure out where to concentrate your >>> energies..... Just comment out the server.add call in your SolrJ program. >>> Well, and any commits you're doing from SolrJ. >>> >>> My bet: Your program will run at about the same speed it does when you >>> actually index the docs, indicating that your problem is in the data >>> acquisition side. Of course the older I get, the more times I've been wrong >>> :). >>> >>> You can also monitor the CPU usage on the box running Solr. I often see >>> it idling along < 30% when indexing, or even < 10%, again indicating that >>> the bottleneck is on the acquisition side. >>> >>> Note I haven't mentioned any solutions, I'm a believer in identifying the >>> _problem_ before worrying about a solution. >>> >>> Best, >>> Erick >>> >>> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky <j...@basetechnology.com> >>> wrote: >>> >>>> Make sure you're not doing a commit on each individual document add. >>>> Commit every few minutes or every few hundred or few thousand >>>> documents is sufficient. You can set up auto commit in solrconfig.xml. >>>> >>>> -- Jack Krupansky >>>> >>>> -----Original Message----- From: Rallavagu >>>> Sent: Wednesday, March 5, 2014 2:37 PM >>>> To: solr-user@lucene.apache.org >>>> Subject: Indexing huge data >>>> >>>> >>>> All, >>>> >>>> Wondering about best practices/common practices to index/re-index huge >>>> amount of data in Solr. The data is about 6 million entries in the db >>>> and other source (data is not located in one resource). Trying with >>>> solrj based solution to collect data from difference resources to >>>> index into Solr. It takes hours to index Solr. >>>> >>>> Thanks in advance >>>> >>>