I'm trying to add lots of documents at once (hundreds of thousands)
in a loop. I don't need these docs to appear as results until I'm
done, though.
For a simple test, I call the post.sh script in a loop with the same
moderately sized xml file. This adds a 20K doc and then commits.
Repeat hundreds of thousands of times.
This works fine for a while, but eventually (only 10K docs in or so)
the Solr instance starts taking longer and longer to respond to my
<add>s (I print out the curl time, near the end it takes 10s an add)
and the web server (resin 3.0) eventually log dumps out with "out of
heap space" (my max heap is 1GB on a 4GB machine.)
I also see the "(Too many open files in system)" stacktrace coming
from Lucene's SegmentReader during this test. My fs.file-max was
361990, which bumped up to 2m, but I don't know how/why Solr/Lucene
would open that many.
My question is about best practices for this sort of "bulk add."
Since insert time is not a concern, I have some leeway. Should I
commit after every add? Should I optimize every so many commits? Is
there some reaper on a thread or timer that I should let breathe?
- lots of inserts very fast, out of heap or file descs Brian Whitman
-