SOLR Memory Usage - Where does it go?

Stephen Weiss Fri, 23 Jul 2010 16:38:11 -0700

We have been having problems with SOLR on one project lately. Forgiveme for writing a novel here but it's really important that we identifythe root cause of this issue. It is becoming unavailable at randomintervals, and the problem appears to be memory related. There arebasically two ways it goes:

1) Straight up OOM error, either from Java or sometimes from thekernel itself.

2) Instead of throwing an OOM, the memory usage gets very high andthen drops precipitously (say, from 92% (of 20GB) down to 60%). Oncethe memory usage is done dropping, SOLR seems to stop responding torequests altogether.

It started out mostly being version #1 of the problem but now we'remostly seeing version #2 of the problem... and it's getting more andmore frequent. In either scenario the servlet container (Jetty) needsto be restarted to resume service.

The number of documents in the index is always going up. They arerelatively small in size (1K per piece max - mostly small numericstrings, with 5 text fields (one each for 5 languages) that are rarelymore than 50-100 characters), and there are about 5 million of them atthe moment (adding around 1000 every day). The machine has 20 GB ofRAM, Xmx is set to 18GB, and SOLR is the only thing this machine /servlet container does. There are a couple other cores configured,but they are miniscule in comparison (one with 200000 docs, and twomore with < 10000 docs a piece). Eliminating these other cores doesnot seem to make any significant impact. This is with the SOLR 1.4.1release, using the SOLR-236 patch that was recently released to gowith this version. The patch was slightly modified in order to ensurethat paging continued to work properly - basically, an optimizationthat eliminated paging was removed per the instructions in this comment:

https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12867680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12867680

I realize this is not ideal if you want to control memory usage, butthe design requirements of the project preclude us from eliminatingeither collapsing or paging. It's also probably worth noting thatthese problems did not start with version 1.4.1 or this version of the236 patch - we actually upgraded from 1.4 because they said it fixedsome memory leaks, hoping it would help solve this problem.

We have some test machines set up and we have been testing out variousconfiguration changes. Watching the stats in the admin area, this iswhat we've been able to figure out:

1) The fieldValueCache usage stays constant at 23 entries (one foreach faceted field), and takes up a total size of about 750MBaltogether.

2) Lowering or just eliminating the filterCache and thequeryResultCache does not seem to have any serious impact - perhaps adifference of a few percent at the start, but after prolonged usagethe memory still goes up seemingly uncontrolled. It would appear thequeryResultCache does not get much usage anyway, and even though wehave higher eviction rates in the filterCache, this really doesn'tseem to impact performance significantly.

3) Lowering or eliminating the documentCache also doesn't seem to havevery much impact in memory usage, although it does make searches muchslower.

4) We followed the instructions for configuring the HashDocSetparameter, but this doesn't seem to be having much impact either.

5) All the caches, with the exception of the documentCache, areFastLRUCaches. Switching between FastLRUCache and normal LRUCache ingeneral doesn't seem to change the memory usage.

6) Glancing through all of the data on memory usage in the LucenefieldCache would indicate that this cache is using well under 1GB ofRAM as well.

Basically, when the servlet first starts, it uses very little RAM(<4%). We warm the searcher with a few standard queries thatinitialize everything in the fieldValueCache off the bat, and thequery performance levels off at a reasonable speed, with memory usagearound 10-12%. At this point, almost all queries execute within a few100ms, if not faster. A very few queries that return large numbers ofcollapsed documents, generally 800K up to about 2 million (we haveabout 5 distinct queries that do this), will take up to 20 seconds torun the first time, and up to 10 seconds thereafter. Even afterrunning all these queries, memory usage stays around 20-30%. At thispoint, performance is optimal. We simulate production usage, runningqueries taken from those logs through the system at a rate similar toproduction use.

For the most part, memory usage stays level. Usage will go up asqueries are run (this seems to correspond with when they are beingcollapsed), but then go back down as the results are returned. Then,over the course of a few hours, at seemingly random intervals, memoryusage will go up and stay up, plateauing at some new level.Performance doesn't change really at this point - it's still the samespeed it was before. SOLR is simply using more memory than it wasbefore, but not really doing anything more than it was before either.If we look at the stats on the caches, the caches do not seem to beany larger than they were after it was freshly warmed. Eventually,RAM usage hits 40%, then 50% an hour or two later, until after about8-12 hours it tops out around 90%.

One guess was that SOLR is starting more threads to handle morerequests - however, this isn't borne out in the process list - when Icheck, the number of threads is level at 33 threads, and all thethreads have the same start time. I'm not intimately familiar withhow Jetty works with threading, but it also seems that all the threadsshare the same caches - otherwise, one would expect to see stats onthe different caches in the statistics page (or at least see themchange, depending on what thread one was using).

If SOLR's not using this RAM for caching, and it's not using it fornew documents (we've completely eliminated commits from the equation -in fact, this seems to happen more when there are fewer commits, whichunfortunately means overnight) - what is this RAM going towards? Itdoesn't make any sense to me that it's answering the same queries atthe same speed, but 4 hours later it needs twice as much memory to dothe same thing. If this is a problem with the collapse patch, what isit doing that it needs to leave such high volumes of data in memory,even long after it's done doing its work? If it's not the collapsepatch, then what could it be? Unfortunately, it's really hard to tellhow much RAM most of the caches are using because this information isnot uniformly displayed on the statistics page - we know how manyentries there are, but we don't know how big the entries are or if thesize of the entries changes over time. But in any event, after weturn all caching off, it still seems to happen anyway, so at thispoint it seems safe to say that the excess RAM is not being used forcache anyway.

At the moment, I feel like we've tweaked everything we can think of inthe solrconfig.xml with little change in how it operates. I'm goingto go look now and see if perhaps this might be an issue with theservlet container itself - this is Jetty 6.1.12, we're a littlebehind. But if anyone has any ideas as to where else this memorycould be going, and what practical steps we could do to at least keepthe server from OOMing, any information would be helpful.


Thanks!!

--
Steve

SOLR Memory Usage - Where does it go?

Reply via email to