We have been having problems with SOLR on one project lately. Forgive me for writing a novel here but it's really important that we identify the root cause of this issue. It is becoming unavailable at random intervals, and the problem appears to be memory related. There are basically two ways it goes:

1) Straight up OOM error, either from Java or sometimes from the kernel itself.

2) Instead of throwing an OOM, the memory usage gets very high and then drops precipitously (say, from 92% (of 20GB) down to 60%). Once the memory usage is done dropping, SOLR seems to stop responding to requests altogether.

It started out mostly being version #1 of the problem but now we're mostly seeing version #2 of the problem... and it's getting more and more frequent. In either scenario the servlet container (Jetty) needs to be restarted to resume service.

The number of documents in the index is always going up. They are relatively small in size (1K per piece max - mostly small numeric strings, with 5 text fields (one each for 5 languages) that are rarely more than 50-100 characters), and there are about 5 million of them at the moment (adding around 1000 every day). The machine has 20 GB of RAM, Xmx is set to 18GB, and SOLR is the only thing this machine / servlet container does. There are a couple other cores configured, but they are miniscule in comparison (one with 200000 docs, and two more with < 10000 docs a piece). Eliminating these other cores does not seem to make any significant impact. This is with the SOLR 1.4.1 release, using the SOLR-236 patch that was recently released to go with this version. The patch was slightly modified in order to ensure that paging continued to work properly - basically, an optimization that eliminated paging was removed per the instructions in this comment:

https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12867680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel #action_12867680

I realize this is not ideal if you want to control memory usage, but the design requirements of the project preclude us from eliminating either collapsing or paging. It's also probably worth noting that these problems did not start with version 1.4.1 or this version of the 236 patch - we actually upgraded from 1.4 because they said it fixed some memory leaks, hoping it would help solve this problem.

We have some test machines set up and we have been testing out various configuration changes. Watching the stats in the admin area, this is what we've been able to figure out:

1) The fieldValueCache usage stays constant at 23 entries (one for each faceted field), and takes up a total size of about 750MB altogether.

2) Lowering or just eliminating the filterCache and the queryResultCache does not seem to have any serious impact - perhaps a difference of a few percent at the start, but after prolonged usage the memory still goes up seemingly uncontrolled. It would appear the queryResultCache does not get much usage anyway, and even though we have higher eviction rates in the filterCache, this really doesn't seem to impact performance significantly.

3) Lowering or eliminating the documentCache also doesn't seem to have very much impact in memory usage, although it does make searches much slower.

4) We followed the instructions for configuring the HashDocSet parameter, but this doesn't seem to be having much impact either.

5) All the caches, with the exception of the documentCache, are FastLRUCaches. Switching between FastLRUCache and normal LRUCache in general doesn't seem to change the memory usage.

6) Glancing through all of the data on memory usage in the Lucene fieldCache would indicate that this cache is using well under 1GB of RAM as well.

Basically, when the servlet first starts, it uses very little RAM (<4%). We warm the searcher with a few standard queries that initialize everything in the fieldValueCache off the bat, and the query performance levels off at a reasonable speed, with memory usage around 10-12%. At this point, almost all queries execute within a few 100ms, if not faster. A very few queries that return large numbers of collapsed documents, generally 800K up to about 2 million (we have about 5 distinct queries that do this), will take up to 20 seconds to run the first time, and up to 10 seconds thereafter. Even after running all these queries, memory usage stays around 20-30%. At this point, performance is optimal. We simulate production usage, running queries taken from those logs through the system at a rate similar to production use.

For the most part, memory usage stays level. Usage will go up as queries are run (this seems to correspond with when they are being collapsed), but then go back down as the results are returned. Then, over the course of a few hours, at seemingly random intervals, memory usage will go up and stay up, plateauing at some new level. Performance doesn't change really at this point - it's still the same speed it was before. SOLR is simply using more memory than it was before, but not really doing anything more than it was before either. If we look at the stats on the caches, the caches do not seem to be any larger than they were after it was freshly warmed. Eventually, RAM usage hits 40%, then 50% an hour or two later, until after about 8-12 hours it tops out around 90%.

One guess was that SOLR is starting more threads to handle more requests - however, this isn't borne out in the process list - when I check, the number of threads is level at 33 threads, and all the threads have the same start time. I'm not intimately familiar with how Jetty works with threading, but it also seems that all the threads share the same caches - otherwise, one would expect to see stats on the different caches in the statistics page (or at least see them change, depending on what thread one was using).

If SOLR's not using this RAM for caching, and it's not using it for new documents (we've completely eliminated commits from the equation - in fact, this seems to happen more when there are fewer commits, which unfortunately means overnight) - what is this RAM going towards? It doesn't make any sense to me that it's answering the same queries at the same speed, but 4 hours later it needs twice as much memory to do the same thing. If this is a problem with the collapse patch, what is it doing that it needs to leave such high volumes of data in memory, even long after it's done doing its work? If it's not the collapse patch, then what could it be? Unfortunately, it's really hard to tell how much RAM most of the caches are using because this information is not uniformly displayed on the statistics page - we know how many entries there are, but we don't know how big the entries are or if the size of the entries changes over time. But in any event, after we turn all caching off, it still seems to happen anyway, so at this point it seems safe to say that the excess RAM is not being used for cache anyway.

At the moment, I feel like we've tweaked everything we can think of in the solrconfig.xml with little change in how it operates. I'm going to go look now and see if perhaps this might be an issue with the servlet container itself - this is Jetty 6.1.12, we're a little behind. But if anyone has any ideas as to where else this memory could be going, and what practical steps we could do to at least keep the server from OOMing, any information would be helpful.

Thanks!!

--
Steve

Reply via email to