Re: Solr cache considerations

Isaac Hebsh Sun, 20 Jan 2013 12:47:48 -0800

Wow Erick, The MMap acrtivle is a very fundamental one. Totaly changed my
view. It must be mentioned in SolrPerformanceFactors in SolrWiki...
I'm sorry I did not know it before.
Thank you a lot.
I promise to share my results then my cart will start to fly :)



On Sun, Jan 20, 2013 at 6:08 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> About your question about document cache: Typically the document cache
> has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very
> often. And remember that this cache is only hit when assembling the
> response for a few documents (your page size).
>
> Bottom line: I wouldn't worry about this cache much. It's quite useful
> for processing a particular query faster, but not really intended for
> cross-query use.
>
> Really, I think you're getting the cart before the horse here. Run it
> up the flagpole and try it. Rely on the OS to do its job
> (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
> Find  a bottleneck _then_ tune. Premature optimization and all
> that....
>
> Several tens of millions of docs isn't that large unless the text
> fields are enormous.
>
> Best
> Erick
>
> On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh <isaac.he...@gmail.com>
> wrote:
> > Ok. Thank you everyone for your helpful answers.
> > I understand that fieldValueCache is not used for resolving queries.
> > Is there any cache that can help this basic scenario (a lot of different
> > queries, on a small set of fields)?
> > Does Lucene's FieldCache help (implicitly)?
> > How can I use RAM to reduce I/O in this type of queries?
> >
> >
> > On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe <
> > tomasflo...@gmail.com> wrote:
> >
> >> No, the fieldValueCache is not used for resolving queries. Only for
> >> multi-token faceting and apparently for the stats component too. The
> >> document cache maintains in memory the stored content of the fields you
> are
> >> retrieving or highlighting on. It'll hit if the same document matches
> the
> >> query multiple times and the same fields are requested, but as Eirck
> said,
> >> it is important for cases when multiple components in the same request
> need
> >> to access the same data.
> >>
> >> I think soft committing every 10 minutes is totally fine, but you should
> >> hard commit more often if you are going to be using transaction log.
> >> openSearcher=false will essentially tell Solr not to open a new searcher
> >> after the (hard) commit, so you won't see the new indexed data and
> caches
> >> wont be flushed. openSearcher=false makes sense when you are using
> >> hard-commits together with soft-commits, as the "soft-commit" is dealing
> >> with opening/closing searchers, you don't need hard commits to do it.
> >>
> >> Tomás
> >>
> >>
> >> On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh <isaac.he...@gmail.com>
> >> wrote:
> >>
> >> > Unfortunately, it seems (
> >> > http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html)
> that
> >> > these caches are not per-segment. In this case, I want to (soft)
> commit
> >> > less frequently. Am I right?
> >> >
> >> > Tomás, as the fieldValueCache is very similar to lucene's FieldCache,
> I
> >> > guess it has a big contribution to standard (not only faceted) queries
> >> > time. SolrWiki claims that it primarily used by faceting. What that
> says
> >> > about complex textual queries?
> >> >
> >> > documentCache:
> >> > Erick, After a query processing is finished, doesn't some documents
> stay
> >> in
> >> > the documentCache? can't I use it to accelerate queries that should
> >> > retrieve stored fields of documents? In this case, a big documentCache
> >> can
> >> > hold more documents..
> >> >
> >> > About commit frequency:
> >> > HardCommit: "openSearch=false" seems as a nice solution. Where can I
> read
> >> > about this? (found nothing but one unexplained sentence in SolrWiki).
> >> > SoftCommit: In my case, the required index freshness is 10 minutes.
> The
> >> > plan to soft commit every 10 minutes is similar to storing all of the
> >> > documents in a queue (outside to Solr), an indexing a bulk every 10
> >> > minutes.
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe <
> >> > tomasflo...@gmail.com> wrote:
> >> >
> >> > > I think fieldValueCache is not per segment, only fieldCache is.
> >> However,
> >> > > unless I'm missing something, this cache is only used for faceting
> on
> >> > > multivalued fields
> >> > >
> >> > >
> >> > > On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson <
> >> erickerick...@gmail.com
> >> > > >wrote:
> >> > >
> >> > > > filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters
> in
> >> > > > cache). Notice the /8. This reflects the fact that the filters are
> >> > > > represented by a bitset on the _internal_ Lucene ID. UniqueId has
> no
> >> > > > bearing here whatsoever. This is, in a nutshell, why warming is
> >> > > > required, the internal Lucene IDs may change. Note also that it's
> >> > > > maxDoc, the internal arrays have "holes" for deleted documents.
> >> > > >
> >> > > > Note this is an _upper_ bound, if there are only a few docs that
> >> > > > match, the size will be (num of matching docs) * sizeof(int)).
> >> > > >
> >> > > > fieldValueCache. I don't think so, although I'm a bit fuzzy on
> this.
> >> > > > It depends on whether these are "per-segment" caches or not. Any
> "per
> >> > > > segment" cache is still valid.
> >> > > >
> >> > > > Think of documentCache as intended to hold the stored fields while
> >> > > > various components operate on it, thus avoiding repeatedly
> fetching
> >> > > > the data from disk. It's _usually_ not too big a worry.
> >> > > >
> >> > > > About hard-commits once a day. That's _extremely_ long. Think
> instead
> >> > > > of committing more frequently with openSearcher=false. If nothing
> >> > > > else, you transaction log will grow lots and lots and lots. I'm
> >> > > > thinking on the order of 15 minutes, or possibly even much less.
> With
> >> > > > softCommits happening more often, maybe every 15 seconds. In fact,
> >> I'd
> >> > > > start out with soft commits every 15 seconds and hard commits
> >> > > > (openSearcher=false) every 5 minutes. The problem with hard
> commits
> >> > > > being once a day is that, if for any reason the server is
> >> interrupted,
> >> > > > on startup Solr will try to replay the entire transaction log to
> >> > > > assure index integrity. Not to mention that your tlog will be
> huge.
> >> > > > Not to mention that there is some memory usage for each document
> in
> >> > > > the tlog. Hard commits roll over the tlog, flush the in-memory
> tlog
> >> > > > pointers, close index segments, etc.
> >> > > >
> >> > > > Best
> >> > > > Erick
> >> > > >
> >> > > > On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh <
> isaac.he...@gmail.com>
> >> > > > wrote:
> >> > > > > Hi,
> >> > > > >
> >> > > > > I am going to build a big Solr (4.0?) index, which holds some
> >> dozens
> >> > of
> >> > > > > millions of documents. Each document has some dozens of fields,
> and
> >> > one
> >> > > > big
> >> > > > > textual field.
> >> > > > > The queries on the index are non-trivial, and a little-bit long
> >> > (might
> >> > > be
> >> > > > > hundreds of terms). No query is identical to another.
> >> > > > >
> >> > > > > Now, I want to analyze the cache performance (before setting up
> the
> >> > > whole
> >> > > > > environment), in order to estimate how much RAM will I need.
> >> > > > >
> >> > > > > filterCache:
> >> > > > > In my scenariom, every query has some filters. let's say that
> each
> >> > > filter
> >> > > > > matches 1M documents, out of 10M. Does the estimated memory
> usage
> >> > > should
> >> > > > be
> >> > > > > 1M * sizeof(uniqueId) * num-of-filters-in-cache?
> >> > > > >
> >> > > > > fieldValueCache:
> >> > > > > Due to the difference between queries, I guess that
> fieldValueCache
> >> > is
> >> > > > the
> >> > > > > most important factor on query performance. Here comes a generic
> >> > > > question:
> >> > > > > I'm indexing new documents to the index constantly. Soft commits
> >> will
> >> > > be
> >> > > > > performed every 10 mins. Does it say that the cache is
> meaningless,
> >> > > after
> >> > > > > every 10 minutes?
> >> > > > >
> >> > > > > documentCache:
> >> > > > > enableLazyFieldLoading will be enabled, and "fl" contains a very
> >> > small
> >> > > > set
> >> > > > > of fields. BUT, I need to return highlighting on about
> (possibly)
> >> 20
> >> > > > > fields. Does the highlighting component use the documentCache? I
> >> > guess
> >> > > > that
> >> > > > > highlighting requires the whole field to be loaded into the
> >> > > > documentCache.
> >> > > > > Will it happen only for fields that matched a term from the
> query?
> >> > > > >
> >> > > > > And one more question: I'm planning to hard-commit once a day.
> >> > Should I
> >> > > > > prepare to a significant RAM usage growth between hard-commits?
> >> > > > (consider a
> >> > > > > lot of new documents in this period...)
> >> > > > > Does this RAM comes from the same pool as the caches? An
> >> OutOfMemory
> >> > > > > exception can happen is this scenario?
> >> > > > >
> >> > > > > Thanks a lot.
> >> > > >
> >> > >
> >> >
> >>
>

Re: Solr cache considerations

Reply via email to