Hi Shawn,
a big thanks for the long and detailed answer. I am aware of how linux
uses free RAM for caching and the the problems related to jvm and GC. It
is nice to hear how this correlates to Solr. I'll take some time and
think over it. The facet.method=enum and probably a combination of
DocValue-Fields could be the solution needed in this case.
Thanks again to both of you and Toke for the feedback!
Cheers
Angel
On 05.03.2014 17:06, Shawn Heisey wrote:
On 3/5/2014 4:40 AM, Angel Tchorbadjiiski wrote:
Hi Shawn,
On 05.03.2014 10:05, Angel Tchorbadjiiski wrote:
Hi Shawn,
It may be your facets that are killing you here. As Toke mentioned, you
have not indicated what your max heap is.20 separate facet fields with
millions of documents will use a lot of fieldcache memory if you use the
standard facet.method, fc.
Try adding facet.method=enum to all your facet queries, or you can put
it in the defaults section of each request handler definition.
Ok, that is easy to try out.
Changing the facet.method does not help really as the performance of the
queries is really bad. This lies mostly on the small cache values, but
even trying to tune them for the "enum" case didn't help much.
The number of documents and unique facet values seems to be too high.
Trying to cache them even with a size of 512 results in many misses and
Solr tries to repopulate the cache all the time. This makes the
performances even worse.
Good performance with Solr requires a fair amount of memory. You have
two choices when it comes to where that memory gets used - inside Solr
in the form of caches, or free memory, available to the operating system
for caching purposes.
Solr caches are really amazing things. Data gathered for one query can
significantly speed up another query, because part (or all) of that
query can be simply skipped, the results read right out of the cache.
There are two potential problems with relying exclusively on Solr
caches, though. One is that they require Java heap memory, which
requires garbage collection. A large heap causes GC issues, some of
which can be alleviated by GC tuning. The other problem is that you
must actually do a query in order to get the data into the cache. WHen
you do a commit and open a new searcher, that cache data does away, so
you have to do the query over again.
The primary reason for slow uncached queries is disk access. Reading
index data off the disk is a glacial process, comparatively speaking.
This is where OS disk caching becomes a benefit. Most queries, even
complex ones, become lightning fast if all of the relevant index data is
already in RAM and no disk access is required. When queries are fast to
begin with, you can reduce the cache sizes in Solr, reducing the heap
requirements. With a smaller heap, more memory is available for the OS
disk cache.
The facet.method=enum parameter shifts the RAM requirement from Solr to
the OS. It does not really reduce the amount of required system memory.
Because disk caching is a kernel level feature and does not utilize
garbage collection, it is far more efficient than Solr ever could be at
caching *raw* data. Solr's caches are designed for *processed* data.
What this all boils down to is that I suspect you'll simply need more
memory on the machine. With facets on so many fields, your queries are
probably touching nearly the entire index, so you'll want to put the
entire index into RAM.
Therefore, after Solr allocates its heap and any other programs on the
system allocate their required memory, you must have enough left memory
over to fit all (or most) of your 50GB index data. Combine this with
facet.method=enum and everything should be good.