: That is still really small for 5MB documents. I think the default solr 
: document cache is 512 items, so you would need at least 3 GB of memory 
: if you didn't change that and the cache filled up.

that assumes that the extracted text tika extracts from each document is 
the same size as the original raw files *and* that he's configured that 
content field to be "stored" ... in practice if you only stored=true the 
summary fields (title, author, short summary, etc...) the document cache 
isn't going to be nearly that big (and even if you do store the entire 
content field, the plain text is usually *much* msaller then the binary 
source file)

: -Xmx128M - my understanding is that this bumps heap size to 128M.

FWIW: depending on how many docs you are indexing, and wether you want to 
support things like faceting that rely on building in memory caches to be 
fast, 128MB is really, really, really small for a typical Solr instance.

Even on a box that is only doing indexing (no queries) i would imagine 
Tika likes to have a lot of ram when doing extraction (most doc types are 
gong to require the raw binary data is entirely in the heap, plus all hte 
extracted Strings, plus all of the connecting objects to build the DOM, 
etc....  And that's before you even start thinking about Solr & Lucene and 
the index itself.

-Hoss

Reply via email to