We have been having problems with SOLR on one project lately. Forgive
me for writing a novel here but it's really important that we identify
the root cause of this issue. It is becoming unavailable at random
intervals, and the problem appears to be memory related. There are
basically two ways it goes:
1) Straight up OOM error, either from Java or sometimes from the
kernel itself.
2) Instead of throwing an OOM, the memory usage gets very high and
then drops precipitously (say, from 92% (of 20GB) down to 60%). Once
the memory usage is done dropping, SOLR seems to stop responding to
requests altogether.
It started out mostly being version #1 of the problem but now we're
mostly seeing version #2 of the problem... and it's getting more and
more frequent. In either scenario the servlet container (Jetty) needs
to be restarted to resume service.
The number of documents in the index is always going up. They are
relatively small in size (1K per piece max - mostly small numeric
strings, with 5 text fields (one each for 5 languages) that are rarely
more than 50-100 characters), and there are about 5 million of them at
the moment (adding around 1000 every day). The machine has 20 GB of
RAM, Xmx is set to 18GB, and SOLR is the only thing this machine /
servlet container does. There are a couple other cores configured,
but they are miniscule in comparison (one with 200000 docs, and two
more with < 10000 docs a piece). Eliminating these other cores does
not seem to make any significant impact. This is with the SOLR 1.4.1
release, using the SOLR-236 patch that was recently released to go
with this version. The patch was slightly modified in order to ensure
that paging continued to work properly - basically, an optimization
that eliminated paging was removed per the instructions in this comment:
https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12867680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel
#action_12867680
I realize this is not ideal if you want to control memory usage, but
the design requirements of the project preclude us from eliminating
either collapsing or paging. It's also probably worth noting that
these problems did not start with version 1.4.1 or this version of the
236 patch - we actually upgraded from 1.4 because they said it fixed
some memory leaks, hoping it would help solve this problem.
We have some test machines set up and we have been testing out various
configuration changes. Watching the stats in the admin area, this is
what we've been able to figure out:
1) The fieldValueCache usage stays constant at 23 entries (one for
each faceted field), and takes up a total size of about 750MB
altogether.
2) Lowering or just eliminating the filterCache and the
queryResultCache does not seem to have any serious impact - perhaps a
difference of a few percent at the start, but after prolonged usage
the memory still goes up seemingly uncontrolled. It would appear the
queryResultCache does not get much usage anyway, and even though we
have higher eviction rates in the filterCache, this really doesn't
seem to impact performance significantly.
3) Lowering or eliminating the documentCache also doesn't seem to have
very much impact in memory usage, although it does make searches much
slower.
4) We followed the instructions for configuring the HashDocSet
parameter, but this doesn't seem to be having much impact either.
5) All the caches, with the exception of the documentCache, are
FastLRUCaches. Switching between FastLRUCache and normal LRUCache in
general doesn't seem to change the memory usage.
6) Glancing through all of the data on memory usage in the Lucene
fieldCache would indicate that this cache is using well under 1GB of
RAM as well.
Basically, when the servlet first starts, it uses very little RAM
(<4%). We warm the searcher with a few standard queries that
initialize everything in the fieldValueCache off the bat, and the
query performance levels off at a reasonable speed, with memory usage
around 10-12%. At this point, almost all queries execute within a few
100ms, if not faster. A very few queries that return large numbers of
collapsed documents, generally 800K up to about 2 million (we have
about 5 distinct queries that do this), will take up to 20 seconds to
run the first time, and up to 10 seconds thereafter. Even after
running all these queries, memory usage stays around 20-30%. At this
point, performance is optimal. We simulate production usage, running
queries taken from those logs through the system at a rate similar to
production use.
For the most part, memory usage stays level. Usage will go up as
queries are run (this seems to correspond with when they are being
collapsed), but then go back down as the results are returned. Then,
over the course of a few hours, at seemingly random intervals, memory
usage will go up and stay up, plateauing at some new level.
Performance doesn't change really at this point - it's still the same
speed it was before. SOLR is simply using more memory than it was
before, but not really doing anything more than it was before either.
If we look at the stats on the caches, the caches do not seem to be
any larger than they were after it was freshly warmed. Eventually,
RAM usage hits 40%, then 50% an hour or two later, until after about
8-12 hours it tops out around 90%.
One guess was that SOLR is starting more threads to handle more
requests - however, this isn't borne out in the process list - when I
check, the number of threads is level at 33 threads, and all the
threads have the same start time. I'm not intimately familiar with
how Jetty works with threading, but it also seems that all the threads
share the same caches - otherwise, one would expect to see stats on
the different caches in the statistics page (or at least see them
change, depending on what thread one was using).
If SOLR's not using this RAM for caching, and it's not using it for
new documents (we've completely eliminated commits from the equation -
in fact, this seems to happen more when there are fewer commits, which
unfortunately means overnight) - what is this RAM going towards? It
doesn't make any sense to me that it's answering the same queries at
the same speed, but 4 hours later it needs twice as much memory to do
the same thing. If this is a problem with the collapse patch, what is
it doing that it needs to leave such high volumes of data in memory,
even long after it's done doing its work? If it's not the collapse
patch, then what could it be? Unfortunately, it's really hard to tell
how much RAM most of the caches are using because this information is
not uniformly displayed on the statistics page - we know how many
entries there are, but we don't know how big the entries are or if the
size of the entries changes over time. But in any event, after we
turn all caching off, it still seems to happen anyway, so at this
point it seems safe to say that the excess RAM is not being used for
cache anyway.
At the moment, I feel like we've tweaked everything we can think of in
the solrconfig.xml with little change in how it operates. I'm going
to go look now and see if perhaps this might be an issue with the
servlet container itself - this is Jetty 6.1.12, we're a little
behind. But if anyone has any ideas as to where else this memory
could be going, and what practical steps we could do to at least keep
the server from OOMing, any information would be helpful.
Thanks!!
--
Steve