I'm trying to add lots of documents at once (hundreds of thousands) in a loop. I don't need these docs to appear as results until I'm done, though.

For a simple test, I call the post.sh script in a loop with the same moderately sized xml file. This adds a 20K doc and then commits. Repeat hundreds of thousands of times.

This works fine for a while, but eventually (only 10K docs in or so) the Solr instance starts taking longer and longer to respond to my <add>s (I print out the curl time, near the end it takes 10s an add) and the web server (resin 3.0) eventually log dumps out with "out of heap space" (my max heap is 1GB on a 4GB machine.)

I also see the "(Too many open files in system)" stacktrace coming from Lucene's SegmentReader during this test. My fs.file-max was 361990, which bumped up to 2m, but I don't know how/why Solr/Lucene would open that many.


My question is about best practices for this sort of "bulk add." Since insert time is not a concern, I have some leeway. Should I commit after every add? Should I optimize every so many commits? Is there some reaper on a thread or timer that I should let breathe?









Reply via email to