On 1-Jun-07, at 6:35 PM, Jordan Hayes wrote:
New user here, and I ran into a problem trying to load a lot of
documents (~900k). I tried to load them all at once, which seemed
to run for a long time and then finally crap out with "Too many
open files" ... so I read in an FAQ that "about 100" might be a
good number. I split my documents up and added <commit/> to the
end of each batch, and got about 10k into it before getting that
error again.
I'm not clear exactly what you mean by batches. There are two types:
1. batches of documents sent in a single <add> command to Solr. Good
values are between 10 and 100, depending on document size. It is
mostly about reducing http overhead (which is small regardless), so
it is very quickly pointless to increase this number. Try persistent
HTTP connections
2. batches of docs sent between <commit/>. In theory unlimited, but
I once ran into a problem that I could not reliably reproduce when
<commit/>ing 4m docs. It occurred under tightish memory conditions
(for 4m docs) and I've since made a change to Solr's deleted docs
algo which should optimize the io in such cases. In any case,
<commit/>ing every 300-400k docs would not hurt. I would not ever
commit as frequently as 100 docs, unless there was query timeliness
requirements.
Am I just doing something wrong?
No. Lucene sometimes just requires many file descriptors (this will
be somewhat alleviated with Solr 1.2). I suggest upping the open
file limit (I upped mine from 1024 to 45000 to handle huge indices).
You can alleviate this by reducing the mergeFactor, but this can
impact indexing performance.
And: is there a way to just hand the XML file to Solr without
having to POST it?
No, but POST'ing shouldn't be a bottleneck.
-Mike