: That is still really small for 5MB documents. I think the default solr : document cache is 512 items, so you would need at least 3 GB of memory : if you didn't change that and the cache filled up.
that assumes that the extracted text tika extracts from each document is the same size as the original raw files *and* that he's configured that content field to be "stored" ... in practice if you only stored=true the summary fields (title, author, short summary, etc...) the document cache isn't going to be nearly that big (and even if you do store the entire content field, the plain text is usually *much* msaller then the binary source file) : -Xmx128M - my understanding is that this bumps heap size to 128M. FWIW: depending on how many docs you are indexing, and wether you want to support things like faceting that rely on building in memory caches to be fast, 128MB is really, really, really small for a typical Solr instance. Even on a box that is only doing indexing (no queries) i would imagine Tika likes to have a lot of ram when doing extraction (most doc types are gong to require the raw binary data is entirely in the heap, plus all hte extracted Strings, plus all of the connecting objects to build the DOM, etc.... And that's before you even start thinking about Solr & Lucene and the index itself. -Hoss