Re: SOLR 4.1 Out Of Memory error After commit of a few thousand Solr Docs

Shawn Heisey Sun, 27 Jan 2013 22:23:22 -0800

On 1/27/2013 10:28 PM, Rahul Bishnoi wrote:

Thanks for your reply. After following your suggestions we were able to
index 30k documents. I have some queries:


1) What is stored in the RAM while only indexing is going on?  How to
calculate the RAM/heap requirements for our documents?
2) The document cache, filter cache, etc...are populated while querying.
Correct me if I am wrong. Are there any caches that are populated while
indexing?

If anyone catches me making statements that are not true, please feelfree to correct me.

The caches are indeed only used during querying. If you are not makingqueries at all, they aren't much of a factor.

I can't give you any definitive answers to your question about RAM usageand how to calculate RAM/heap requirements. I can make some generalstatements without looking at the code, just based on what I've learnedso far about Solr, and about Java in general.

You would have an exact copy of the input text for each field initially,which would ultimately get used for the stored data (for those fieldsthat are stored). Each one is probably just a plain String, though Idon't know as I haven't read the code. If the field is not being storedor copied, then it would be possible to get rid of that data as soon asit is no longer required for indexing. I don't have any idea whetherSolr/Lucene code actually gets rid of the exact copy in this way.

If you are storing termvectors, additional memory would be needed forthat. I don't know if that involves lots of objects or if it's oneobject with index information. Based on my experience, termvectors canbe bigger than the stored data for the same field.

Tokenization and filtering is where I imagine that most of the memorywould get used. If you're using a filter like EdgeNGram, that's a LOTof tokens. Even if you're just tokenizing words, it can add up. Thereis also space required for the inverted index, norms, and otherdata/metadata. If each token is a separate Java object (which I do notknow), there would be a fair amount of memory overhead involved.

A String object in java has something like 40 bytes of overhead aboveand beyond the space required for the data. Also, strings in Java areinternally represented in UTF-16, so each character actually takes twobytes.


http://www.javamex.com/tutorials/memory/string_memory_usage.shtml

The finished documents stack up in the ramBufferSizeMB space until itgets full or a hard commit is issued, at which point they are flushed todisk as a Lucene segment. One thing that I'm not sure about is whetheran additional ram buffer is allocated for further indexing while theflush is happening, or if it flushes and then re-uses the buffer forsubsequent documents.

Another way that it can use memory is when merging index segments. Idon't know how much memory gets used for this process.

On Solr 4 with the default directory factory, part of a flushed segmentmay remain in RAM until enough additional segment data is created. Theamount of memory used by this feature should be pretty small, unless youhave a lot of cores on a single JVM. That extra memory can beeliminated by using MMapDirectoryFactory instead ofNRTCachingDirectoryFactory, at the expense of fast Near-RealTime indexupdates.


Thanks,
Shawn

Re: SOLR 4.1 Out Of Memory error After commit of a few thousand Solr Docs

Reply via email to