First, turn off all your soft commit stuff, that won't help in your situation.
If you do leave autocommit on, make it a really high number
(let's say 1,000,000 to start).

You won't have to make 300M calls, you can batch, say, 1,000 docs
into each request.

DIH supports a bunch of different data sources, take a
look at: http://wiki.apache.org/solr/DataImportHandler, the
EntityProcessor, DataSource and the like.

There is also the CSV update processor, see:
http://wiki.apache.org/solr/UpdateCSV. It might be better to, say,
break up your massive file into N CSV files and import those.

Best
Erick

On Thu, Jul 19, 2012 at 12:04 PM, Jonatan Fournier
<jonatan.fourn...@gmail.com> wrote:
> Hello,
>
> I was wondering if there's other ways to import data in Solr than
> posting xml/json/csv to the server URL (e.g. locally building the
> index). Is the DataImporter only for database?
>
> My data is in an enormous text file that is parsed in python, I get
> clean json/xml out of it if I want, but the thing is that it drills
> down to about 300 millions "documents", so I don't want to execute 300
> millions http post in a for loop, even with relaxed soft commits etc
> it will take weeks, months to populate the index.
>
> I need to do that only once on an offline server and never add data
> back to the index (e.g. becomes a read-only instance).
>
> Any temporary index configuration I could have to populate the server
> with optimal add speed, then turn back the settings optimized for a
> read only instance?
>
> Thanks!
>
> --
> jonatan

Reply via email to