> My data is in an enormous text file that is parsed in python,

You mean it is in Python s-expressions? I don't think there is a
parser in DIH for that.

On Thu, Jul 19, 2012 at 9:27 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> First, turn off all your soft commit stuff, that won't help in your situation.
> If you do leave autocommit on, make it a really high number
> (let's say 1,000,000 to start).
>
> You won't have to make 300M calls, you can batch, say, 1,000 docs
> into each request.
>
> DIH supports a bunch of different data sources, take a
> look at: http://wiki.apache.org/solr/DataImportHandler, the
> EntityProcessor, DataSource and the like.
>
> There is also the CSV update processor, see:
> http://wiki.apache.org/solr/UpdateCSV. It might be better to, say,
> break up your massive file into N CSV files and import those.
>
> Best
> Erick
>
> On Thu, Jul 19, 2012 at 12:04 PM, Jonatan Fournier
> <jonatan.fourn...@gmail.com> wrote:
>> Hello,
>>
>> I was wondering if there's other ways to import data in Solr than
>> posting xml/json/csv to the server URL (e.g. locally building the
>> index). Is the DataImporter only for database?
>>
>> My data is in an enormous text file that is parsed in python, I get
>> clean json/xml out of it if I want, but the thing is that it drills
>> down to about 300 millions "documents", so I don't want to execute 300
>> millions http post in a for loop, even with relaxed soft commits etc
>> it will take weeks, months to populate the index.
>>
>> I need to do that only once on an offline server and never add data
>> back to the index (e.g. becomes a read-only instance).
>>
>> Any temporary index configuration I could have to populate the server
>> with optimal add speed, then turn back the settings optimized for a
>> read only instance?
>>
>> Thanks!
>>
>> --
>> jonatan



-- 
Lance Norskog
goks...@gmail.com

Reply via email to