How to index a single big file?

neosky Sat, 10 Mar 2012 11:36:26 -0800

Hello, I have a great challenge here. I have a big file(1.2G) with more than
200 million records need to index.  It might more than 9 G file with more
than 1000 million record later.
One record contains 3 fields. I am quite newer for solr and lucene, so I
have some questions:
1. It seems that solr only works with the xml file, so I must transform the
text file into xml?
2. Even I transform the file into the xml format, can the solr deal with
this big file?
So, I have some ideas here.Maybe I should split the big file first. 
1. One option is I split one record into one file, but it seems that it will
produce million files and it still hard to store and index.
2. Another option is that I split the file into some smaller file about 10M.
But it seems that it is also difficult to split based on file size that
doesn't mess up the format.
Do you guys have any experience on indexing this kind of big file?  Any idea
or suggestion are helpful.
Thanks in advance!


attached one record sample
original raw data:
>A0B531 A0B531_METTP^|^^|^Putative uncharacterized
protein^|^^|^^|^Methanosaeta thermophila PT^|^349307^|^Arch/Euryar^|^28890
MLFALALSLLILTSGSRSIELNNATVIDLAEGKAVIEQPVSGKIFNITAIARIENISVIH
NSHTARCSVEESFWRGVYRYRITADSPVSGILRYEAPLRGQQFISPIVLNGTVVVAIPEG
YTTGARALGIPRPEPYEIFHENRTVVVWRLERESIVEVGFYRNDAPQILGYFFVLLLAAG
IFLAAGYYSSIKKLEAMRRGLK

I plan to format
<ID>
>A0B531
</ID>
<description>
A0B531_METTP^|^^|^Putative uncharacterized protein^|^^|^^|^Methanosaeta
thermophila PT^|^349307^|^Arch/Euryar^|^28890
</description>
<text>
MLFALALSLLILTSGSRSIELNNATVIDLAEGKAVIEQPVSGKIFNITAIARIENISVIH
NSHTARCSVEESFWRGVYRYRITADSPVSGILRYEAPLRGQQFISPIVLNGTVVVAIPEG
YTTGARALGIPRPEPYEIFHENRTVVVWRLERESIVEVGFYRNDAPQILGYFFVLLLAAG
IFLAAGYYSSIKKLEAMRRGLK
</text>





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-index-a-single-big-file-tp3815540p3815540.html
Sent from the Solr - User mailing list archive at Nabble.com.

How to index a single big file?

Reply via email to