On Mar 10, 2012, at 1:52 PM, neosky wrote:

> Hello, I have a great challenge here. I have a big file(1.2G) with more than
> 200 million records need to index.  It might more than 9 G file with more
> than 1000 million record later.
> One record contains 3 fields. I am quite newer for solr and lucene, so I
> have some questions:
> 1. It seems that solr only works with the xml file, so I must transform the
> text file into xml?

There are other formats supported, including just using the SolrJ client and 
some of your own code that loops through the files.   I wouldn't bother 
converting to XML, just have your SolrJ program take in a record and convert it 
to a SolrInputDocument and then send in.


> 2. Even I transform the file into the xml format, can the solr deal with
> this big file?
> So, I have some ideas here.Maybe I should split the big file first. 
> 1. One option is I split one record into one file, but it seems that it will
> produce million files and it still hard to store and index.
> 2. Another option is that I split the file into some smaller file about 10M.
> But it seems that it is also difficult to split based on file size that
> doesn't mess up the format.
> Do you guys have any experience on indexing this kind of big file?  Any idea
> or suggestion are helpful.

I would likely split into some subset of smaller files (I would guess in the 
range of 10-30M recs per file) and then process those files in parallel 
(multithreaded) using SolrJ and sending in batches of documents at once or 
using the StreamingUpdateSolrServer. 

There are lots of good tutorials on using SolrJ available.


--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com



Reply via email to