On 1/27/2013 10:28 PM, Rahul Bishnoi wrote:
Thanks for your reply. After following your suggestions we were able to
index 30k documents. I have some queries:
1) What is stored in the RAM while only indexing is going on? How to
calculate the RAM/heap requirements for our documents?
2) The document cache, filter cache, etc...are populated while querying.
Correct me if I am wrong. Are there any caches that are populated while
indexing?
If anyone catches me making statements that are not true, please feel
free to correct me.
The caches are indeed only used during querying. If you are not making
queries at all, they aren't much of a factor.
I can't give you any definitive answers to your question about RAM usage
and how to calculate RAM/heap requirements. I can make some general
statements without looking at the code, just based on what I've learned
so far about Solr, and about Java in general.
You would have an exact copy of the input text for each field initially,
which would ultimately get used for the stored data (for those fields
that are stored). Each one is probably just a plain String, though I
don't know as I haven't read the code. If the field is not being stored
or copied, then it would be possible to get rid of that data as soon as
it is no longer required for indexing. I don't have any idea whether
Solr/Lucene code actually gets rid of the exact copy in this way.
If you are storing termvectors, additional memory would be needed for
that. I don't know if that involves lots of objects or if it's one
object with index information. Based on my experience, termvectors can
be bigger than the stored data for the same field.
Tokenization and filtering is where I imagine that most of the memory
would get used. If you're using a filter like EdgeNGram, that's a LOT
of tokens. Even if you're just tokenizing words, it can add up. There
is also space required for the inverted index, norms, and other
data/metadata. If each token is a separate Java object (which I do not
know), there would be a fair amount of memory overhead involved.
A String object in java has something like 40 bytes of overhead above
and beyond the space required for the data. Also, strings in Java are
internally represented in UTF-16, so each character actually takes two
bytes.
http://www.javamex.com/tutorials/memory/string_memory_usage.shtml
The finished documents stack up in the ramBufferSizeMB space until it
gets full or a hard commit is issued, at which point they are flushed to
disk as a Lucene segment. One thing that I'm not sure about is whether
an additional ram buffer is allocated for further indexing while the
flush is happening, or if it flushes and then re-uses the buffer for
subsequent documents.
Another way that it can use memory is when merging index segments. I
don't know how much memory gets used for this process.
On Solr 4 with the default directory factory, part of a flushed segment
may remain in RAM until enough additional segment data is created. The
amount of memory used by this feature should be pretty small, unless you
have a lot of cores on a single JVM. That extra memory can be
eliminated by using MMapDirectoryFactory instead of
NRTCachingDirectoryFactory, at the expense of fast Near-RealTime index
updates.
Thanks,
Shawn