On 1/27/2013 10:28 PM, Rahul Bishnoi wrote:
Thanks for your reply. After following your suggestions we were able to
index 30k documents. I have some queries:

1) What is stored in the RAM while only indexing is going on?  How to
calculate the RAM/heap requirements for our documents?
2) The document cache, filter cache, etc...are populated while querying.
Correct me if I am wrong. Are there any caches that are populated while
indexing?

If anyone catches me making statements that are not true, please feel free to correct me.

The caches are indeed only used during querying. If you are not making queries at all, they aren't much of a factor.

I can't give you any definitive answers to your question about RAM usage and how to calculate RAM/heap requirements. I can make some general statements without looking at the code, just based on what I've learned so far about Solr, and about Java in general.

You would have an exact copy of the input text for each field initially, which would ultimately get used for the stored data (for those fields that are stored). Each one is probably just a plain String, though I don't know as I haven't read the code. If the field is not being stored or copied, then it would be possible to get rid of that data as soon as it is no longer required for indexing. I don't have any idea whether Solr/Lucene code actually gets rid of the exact copy in this way.

If you are storing termvectors, additional memory would be needed for that. I don't know if that involves lots of objects or if it's one object with index information. Based on my experience, termvectors can be bigger than the stored data for the same field.

Tokenization and filtering is where I imagine that most of the memory would get used. If you're using a filter like EdgeNGram, that's a LOT of tokens. Even if you're just tokenizing words, it can add up. There is also space required for the inverted index, norms, and other data/metadata. If each token is a separate Java object (which I do not know), there would be a fair amount of memory overhead involved.

A String object in java has something like 40 bytes of overhead above and beyond the space required for the data. Also, strings in Java are internally represented in UTF-16, so each character actually takes two bytes.

http://www.javamex.com/tutorials/memory/string_memory_usage.shtml

The finished documents stack up in the ramBufferSizeMB space until it gets full or a hard commit is issued, at which point they are flushed to disk as a Lucene segment. One thing that I'm not sure about is whether an additional ram buffer is allocated for further indexing while the flush is happening, or if it flushes and then re-uses the buffer for subsequent documents.

Another way that it can use memory is when merging index segments. I don't know how much memory gets used for this process.

On Solr 4 with the default directory factory, part of a flushed segment may remain in RAM until enough additional segment data is created. The amount of memory used by this feature should be pretty small, unless you have a lot of cores on a single JVM. That extra memory can be eliminated by using MMapDirectoryFactory instead of NRTCachingDirectoryFactory, at the expense of fast Near-RealTime index updates.

Thanks,
Shawn

Reply via email to