Re: Solr cache considerations

Otis Gospodnetic Tue, 22 Jan 2013 09:37:05 -0800

Same here - I've seen some document caches that were huge and highly
utilized.  Check out the screenshot of the SPM for Solr dashboard that
shows pretty high hit rates on all caches.  I've circled the parts to look
at.  ML manager may strip the attachment, of course. :)


In addition to multiple in-request lookups and hits in document cache,
document caches provide value when queries are frequently somewhat similar
and thus return some of the same hits as previous queries.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Mon, Jan 21, 2013 at 1:39 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> Hmm, interesting. I'll have to look closer...
>
> On Sun, Jan 20, 2013 at 3:50 PM, Walter Underwood <wun...@wunderwood.org>
> wrote:
> > I routinely see hit rates over 75% on the document cache. Perhaps yours
> is too small. Mine is set at 10240 entries.
> >
> > wunder
> >
> > On Jan 20, 2013, at 8:08 AM, Erick Erickson wrote:
> >
> >> About your question about document cache: Typically the document cache
> >> has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very
> >> often. And remember that this cache is only hit when assembling the
> >> response for a few documents (your page size).
> >>
> >> Bottom line: I wouldn't worry about this cache much. It's quite useful
> >> for processing a particular query faster, but not really intended for
> >> cross-query use.
> >>
> >> Really, I think you're getting the cart before the horse here. Run it
> >> up the flagpole and try it. Rely on the OS to do its job
> >> (
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
> >> Find  a bottleneck _then_ tune. Premature optimization and all
> >> that....
> >>
> >> Several tens of millions of docs isn't that large unless the text
> >> fields are enormous.
> >>
> >> Best
> >> Erick
> >>
> >> On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh <isaac.he...@gmail.com>
> wrote:
> >>> Ok. Thank you everyone for your helpful answers.
> >>> I understand that fieldValueCache is not used for resolving queries.
> >>> Is there any cache that can help this basic scenario (a lot of
> different
> >>> queries, on a small set of fields)?
> >>> Does Lucene's FieldCache help (implicitly)?
> >>> How can I use RAM to reduce I/O in this type of queries?
> >>>
> >>>
> >>> On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe <
> >>> tomasflo...@gmail.com> wrote:
> >>>
> >>>> No, the fieldValueCache is not used for resolving queries. Only for
> >>>> multi-token faceting and apparently for the stats component too. The
> >>>> document cache maintains in memory the stored content of the fields
> you are
> >>>> retrieving or highlighting on. It'll hit if the same document matches
> the
> >>>> query multiple times and the same fields are requested, but as Eirck
> said,
> >>>> it is important for cases when multiple components in the same
> request need
> >>>> to access the same data.
> >>>>
> >>>> I think soft committing every 10 minutes is totally fine, but you
> should
> >>>> hard commit more often if you are going to be using transaction log.
> >>>> openSearcher=false will essentially tell Solr not to open a new
> searcher
> >>>> after the (hard) commit, so you won't see the new indexed data and
> caches
> >>>> wont be flushed. openSearcher=false makes sense when you are using
> >>>> hard-commits together with soft-commits, as the "soft-commit" is
> dealing
> >>>> with opening/closing searchers, you don't need hard commits to do it.
> >>>>
> >>>> Tomás
> >>>>
> >>>>
> >>>> On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh <isaac.he...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Unfortunately, it seems (
> >>>>> http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html)
> that
> >>>>> these caches are not per-segment. In this case, I want to (soft)
> commit
> >>>>> less frequently. Am I right?
> >>>>>
> >>>>> Tomás, as the fieldValueCache is very similar to lucene's
> FieldCache, I
> >>>>> guess it has a big contribution to standard (not only faceted)
> queries
> >>>>> time. SolrWiki claims that it primarily used by faceting. What that
> says
> >>>>> about complex textual queries?
> >>>>>
> >>>>> documentCache:
> >>>>> Erick, After a query processing is finished, doesn't some documents
> stay
> >>>> in
> >>>>> the documentCache? can't I use it to accelerate queries that should
> >>>>> retrieve stored fields of documents? In this case, a big
> documentCache
> >>>> can
> >>>>> hold more documents..
> >>>>>
> >>>>> About commit frequency:
> >>>>> HardCommit: "openSearch=false" seems as a nice solution. Where can I
> read
> >>>>> about this? (found nothing but one unexplained sentence in SolrWiki).
> >>>>> SoftCommit: In my case, the required index freshness is 10 minutes.
> The
> >>>>> plan to soft commit every 10 minutes is similar to storing all of the
> >>>>> documents in a queue (outside to Solr), an indexing a bulk every 10
> >>>>> minutes.
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>
> >>>>> On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe <
> >>>>> tomasflo...@gmail.com> wrote:
> >>>>>
> >>>>>> I think fieldValueCache is not per segment, only fieldCache is.
> >>>> However,
> >>>>>> unless I'm missing something, this cache is only used for faceting
> on
> >>>>>> multivalued fields
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson <
> >>>> erickerick...@gmail.com
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
> >>>>>>> cache). Notice the /8. This reflects the fact that the filters are
> >>>>>>> represented by a bitset on the _internal_ Lucene ID. UniqueId has
> no
> >>>>>>> bearing here whatsoever. This is, in a nutshell, why warming is
> >>>>>>> required, the internal Lucene IDs may change. Note also that it's
> >>>>>>> maxDoc, the internal arrays have "holes" for deleted documents.
> >>>>>>>
> >>>>>>> Note this is an _upper_ bound, if there are only a few docs that
> >>>>>>> match, the size will be (num of matching docs) * sizeof(int)).
> >>>>>>>
> >>>>>>> fieldValueCache. I don't think so, although I'm a bit fuzzy on
> this.
> >>>>>>> It depends on whether these are "per-segment" caches or not. Any
> "per
> >>>>>>> segment" cache is still valid.
> >>>>>>>
> >>>>>>> Think of documentCache as intended to hold the stored fields while
> >>>>>>> various components operate on it, thus avoiding repeatedly fetching
> >>>>>>> the data from disk. It's _usually_ not too big a worry.
> >>>>>>>
> >>>>>>> About hard-commits once a day. That's _extremely_ long. Think
> instead
> >>>>>>> of committing more frequently with openSearcher=false. If nothing
> >>>>>>> else, you transaction log will grow lots and lots and lots. I'm
> >>>>>>> thinking on the order of 15 minutes, or possibly even much less.
> With
> >>>>>>> softCommits happening more often, maybe every 15 seconds. In fact,
> >>>> I'd
> >>>>>>> start out with soft commits every 15 seconds and hard commits
> >>>>>>> (openSearcher=false) every 5 minutes. The problem with hard commits
> >>>>>>> being once a day is that, if for any reason the server is
> >>>> interrupted,
> >>>>>>> on startup Solr will try to replay the entire transaction log to
> >>>>>>> assure index integrity. Not to mention that your tlog will be huge.
> >>>>>>> Not to mention that there is some memory usage for each document in
> >>>>>>> the tlog. Hard commits roll over the tlog, flush the in-memory tlog
> >>>>>>> pointers, close index segments, etc.
> >>>>>>>
> >>>>>>> Best
> >>>>>>> Erick
> >>>>>>>
> >>>>>>> On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh <
> isaac.he...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I am going to build a big Solr (4.0?) index, which holds some
> >>>> dozens
> >>>>> of
> >>>>>>>> millions of documents. Each document has some dozens of fields,
> and
> >>>>> one
> >>>>>>> big
> >>>>>>>> textual field.
> >>>>>>>> The queries on the index are non-trivial, and a little-bit long
> >>>>> (might
> >>>>>> be
> >>>>>>>> hundreds of terms). No query is identical to another.
> >>>>>>>>
> >>>>>>>> Now, I want to analyze the cache performance (before setting up
> the
> >>>>>> whole
> >>>>>>>> environment), in order to estimate how much RAM will I need.
> >>>>>>>>
> >>>>>>>> filterCache:
> >>>>>>>> In my scenariom, every query has some filters. let's say that each
> >>>>>> filter
> >>>>>>>> matches 1M documents, out of 10M. Does the estimated memory usage
> >>>>>> should
> >>>>>>> be
> >>>>>>>> 1M * sizeof(uniqueId) * num-of-filters-in-cache?
> >>>>>>>>
> >>>>>>>> fieldValueCache:
> >>>>>>>> Due to the difference between queries, I guess that
> fieldValueCache
> >>>>> is
> >>>>>>> the
> >>>>>>>> most important factor on query performance. Here comes a generic
> >>>>>>> question:
> >>>>>>>> I'm indexing new documents to the index constantly. Soft commits
> >>>> will
> >>>>>> be
> >>>>>>>> performed every 10 mins. Does it say that the cache is
> meaningless,
> >>>>>> after
> >>>>>>>> every 10 minutes?
> >>>>>>>>
> >>>>>>>> documentCache:
> >>>>>>>> enableLazyFieldLoading will be enabled, and "fl" contains a very
> >>>>> small
> >>>>>>> set
> >>>>>>>> of fields. BUT, I need to return highlighting on about (possibly)
> >>>> 20
> >>>>>>>> fields. Does the highlighting component use the documentCache? I
> >>>>> guess
> >>>>>>> that
> >>>>>>>> highlighting requires the whole field to be loaded into the
> >>>>>>> documentCache.
> >>>>>>>> Will it happen only for fields that matched a term from the query?
> >>>>>>>>
> >>>>>>>> And one more question: I'm planning to hard-commit once a day.
> >>>>> Should I
> >>>>>>>> prepare to a significant RAM usage growth between hard-commits?
> >>>>>>> (consider a
> >>>>>>>> lot of new documents in this period...)
> >>>>>>>> Does this RAM comes from the same pool as the caches? An
> >>>> OutOfMemory
> >>>>>>>> exception can happen is this scenario?
> >>>>>>>>
> >>>>>>>> Thanks a lot.
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >
> > --
> > Walter Underwood
> > wun...@wunderwood.org
> >
> >
> >
>

Re: Solr cache considerations

Reply via email to