Re: Solr cache considerations

Otis Gospodnetic Wed, 23 Jan 2013 11:20:20 -0800

I think the attachment got stripped.  Here it is:
http://www.flickr.com/photos/otis/8409088080/in/photostream


Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Tue, Jan 22, 2013 at 12:36 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Same here - I've seen some document caches that were huge and highly
> utilized.  Check out the screenshot of the SPM for Solr dashboard that
> shows pretty high hit rates on all caches.  I've circled the parts to look
> at.  ML manager may strip the attachment, of course. :)
>
> In addition to multiple in-request lookups and hits in document cache,
> document caches provide value when queries are frequently somewhat similar
> and thus return some of the same hits as previous queries.
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Mon, Jan 21, 2013 at 1:39 PM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
>
>> Hmm, interesting. I'll have to look closer...
>>
>> On Sun, Jan 20, 2013 at 3:50 PM, Walter Underwood <wun...@wunderwood.org>
>> wrote:
>> > I routinely see hit rates over 75% on the document cache. Perhaps yours
>> is too small. Mine is set at 10240 entries.
>> >
>> > wunder
>> >
>> > On Jan 20, 2013, at 8:08 AM, Erick Erickson wrote:
>> >
>> >> About your question about document cache: Typically the document cache
>> >> has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very
>> >> often. And remember that this cache is only hit when assembling the
>> >> response for a few documents (your page size).
>> >>
>> >> Bottom line: I wouldn't worry about this cache much. It's quite useful
>> >> for processing a particular query faster, but not really intended for
>> >> cross-query use.
>> >>
>> >> Really, I think you're getting the cart before the horse here. Run it
>> >> up the flagpole and try it. Rely on the OS to do its job
>> >> (
>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
>> >> Find  a bottleneck _then_ tune. Premature optimization and all
>> >> that....
>> >>
>> >> Several tens of millions of docs isn't that large unless the text
>> >> fields are enormous.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh <isaac.he...@gmail.com>
>> wrote:
>> >>> Ok. Thank you everyone for your helpful answers.
>> >>> I understand that fieldValueCache is not used for resolving queries.
>> >>> Is there any cache that can help this basic scenario (a lot of
>> different
>> >>> queries, on a small set of fields)?
>> >>> Does Lucene's FieldCache help (implicitly)?
>> >>> How can I use RAM to reduce I/O in this type of queries?
>> >>>
>> >>>
>> >>> On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe <
>> >>> tomasflo...@gmail.com> wrote:
>> >>>
>> >>>> No, the fieldValueCache is not used for resolving queries. Only for
>> >>>> multi-token faceting and apparently for the stats component too. The
>> >>>> document cache maintains in memory the stored content of the fields
>> you are
>> >>>> retrieving or highlighting on. It'll hit if the same document
>> matches the
>> >>>> query multiple times and the same fields are requested, but as Eirck
>> said,
>> >>>> it is important for cases when multiple components in the same
>> request need
>> >>>> to access the same data.
>> >>>>
>> >>>> I think soft committing every 10 minutes is totally fine, but you
>> should
>> >>>> hard commit more often if you are going to be using transaction log.
>> >>>> openSearcher=false will essentially tell Solr not to open a new
>> searcher
>> >>>> after the (hard) commit, so you won't see the new indexed data and
>> caches
>> >>>> wont be flushed. openSearcher=false makes sense when you are using
>> >>>> hard-commits together with soft-commits, as the "soft-commit" is
>> dealing
>> >>>> with opening/closing searchers, you don't need hard commits to do it.
>> >>>>
>> >>>> Tomás
>> >>>>
>> >>>>
>> >>>> On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh <isaac.he...@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> Unfortunately, it seems (
>> >>>>> http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html)
>> that
>> >>>>> these caches are not per-segment. In this case, I want to (soft)
>> commit
>> >>>>> less frequently. Am I right?
>> >>>>>
>> >>>>> Tomás, as the fieldValueCache is very similar to lucene's
>> FieldCache, I
>> >>>>> guess it has a big contribution to standard (not only faceted)
>> queries
>> >>>>> time. SolrWiki claims that it primarily used by faceting. What that
>> says
>> >>>>> about complex textual queries?
>> >>>>>
>> >>>>> documentCache:
>> >>>>> Erick, After a query processing is finished, doesn't some documents
>> stay
>> >>>> in
>> >>>>> the documentCache? can't I use it to accelerate queries that should
>> >>>>> retrieve stored fields of documents? In this case, a big
>> documentCache
>> >>>> can
>> >>>>> hold more documents..
>> >>>>>
>> >>>>> About commit frequency:
>> >>>>> HardCommit: "openSearch=false" seems as a nice solution. Where can
>> I read
>> >>>>> about this? (found nothing but one unexplained sentence in
>> SolrWiki).
>> >>>>> SoftCommit: In my case, the required index freshness is 10 minutes.
>> The
>> >>>>> plan to soft commit every 10 minutes is similar to storing all of
>> the
>> >>>>> documents in a queue (outside to Solr), an indexing a bulk every 10
>> >>>>> minutes.
>> >>>>>
>> >>>>> Thanks.
>> >>>>>
>> >>>>>
>> >>>>> On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe <
>> >>>>> tomasflo...@gmail.com> wrote:
>> >>>>>
>> >>>>>> I think fieldValueCache is not per segment, only fieldCache is.
>> >>>> However,
>> >>>>>> unless I'm missing something, this cache is only used for faceting
>> on
>> >>>>>> multivalued fields
>> >>>>>>
>> >>>>>>
>> >>>>>> On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson <
>> >>>> erickerick...@gmail.com
>> >>>>>>> wrote:
>> >>>>>>
>> >>>>>>> filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters
>> in
>> >>>>>>> cache). Notice the /8. This reflects the fact that the filters are
>> >>>>>>> represented by a bitset on the _internal_ Lucene ID. UniqueId has
>> no
>> >>>>>>> bearing here whatsoever. This is, in a nutshell, why warming is
>> >>>>>>> required, the internal Lucene IDs may change. Note also that it's
>> >>>>>>> maxDoc, the internal arrays have "holes" for deleted documents.
>> >>>>>>>
>> >>>>>>> Note this is an _upper_ bound, if there are only a few docs that
>> >>>>>>> match, the size will be (num of matching docs) * sizeof(int)).
>> >>>>>>>
>> >>>>>>> fieldValueCache. I don't think so, although I'm a bit fuzzy on
>> this.
>> >>>>>>> It depends on whether these are "per-segment" caches or not. Any
>> "per
>> >>>>>>> segment" cache is still valid.
>> >>>>>>>
>> >>>>>>> Think of documentCache as intended to hold the stored fields while
>> >>>>>>> various components operate on it, thus avoiding repeatedly
>> fetching
>> >>>>>>> the data from disk. It's _usually_ not too big a worry.
>> >>>>>>>
>> >>>>>>> About hard-commits once a day. That's _extremely_ long. Think
>> instead
>> >>>>>>> of committing more frequently with openSearcher=false. If nothing
>> >>>>>>> else, you transaction log will grow lots and lots and lots. I'm
>> >>>>>>> thinking on the order of 15 minutes, or possibly even much less.
>> With
>> >>>>>>> softCommits happening more often, maybe every 15 seconds. In fact,
>> >>>> I'd
>> >>>>>>> start out with soft commits every 15 seconds and hard commits
>> >>>>>>> (openSearcher=false) every 5 minutes. The problem with hard
>> commits
>> >>>>>>> being once a day is that, if for any reason the server is
>> >>>> interrupted,
>> >>>>>>> on startup Solr will try to replay the entire transaction log to
>> >>>>>>> assure index integrity. Not to mention that your tlog will be
>> huge.
>> >>>>>>> Not to mention that there is some memory usage for each document
>> in
>> >>>>>>> the tlog. Hard commits roll over the tlog, flush the in-memory
>> tlog
>> >>>>>>> pointers, close index segments, etc.
>> >>>>>>>
>> >>>>>>> Best
>> >>>>>>> Erick
>> >>>>>>>
>> >>>>>>> On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh <
>> isaac.he...@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>> Hi,
>> >>>>>>>>
>> >>>>>>>> I am going to build a big Solr (4.0?) index, which holds some
>> >>>> dozens
>> >>>>> of
>> >>>>>>>> millions of documents. Each document has some dozens of fields,
>> and
>> >>>>> one
>> >>>>>>> big
>> >>>>>>>> textual field.
>> >>>>>>>> The queries on the index are non-trivial, and a little-bit long
>> >>>>> (might
>> >>>>>> be
>> >>>>>>>> hundreds of terms). No query is identical to another.
>> >>>>>>>>
>> >>>>>>>> Now, I want to analyze the cache performance (before setting up
>> the
>> >>>>>> whole
>> >>>>>>>> environment), in order to estimate how much RAM will I need.
>> >>>>>>>>
>> >>>>>>>> filterCache:
>> >>>>>>>> In my scenariom, every query has some filters. let's say that
>> each
>> >>>>>> filter
>> >>>>>>>> matches 1M documents, out of 10M. Does the estimated memory usage
>> >>>>>> should
>> >>>>>>> be
>> >>>>>>>> 1M * sizeof(uniqueId) * num-of-filters-in-cache?
>> >>>>>>>>
>> >>>>>>>> fieldValueCache:
>> >>>>>>>> Due to the difference between queries, I guess that
>> fieldValueCache
>> >>>>> is
>> >>>>>>> the
>> >>>>>>>> most important factor on query performance. Here comes a generic
>> >>>>>>> question:
>> >>>>>>>> I'm indexing new documents to the index constantly. Soft commits
>> >>>> will
>> >>>>>> be
>> >>>>>>>> performed every 10 mins. Does it say that the cache is
>> meaningless,
>> >>>>>> after
>> >>>>>>>> every 10 minutes?
>> >>>>>>>>
>> >>>>>>>> documentCache:
>> >>>>>>>> enableLazyFieldLoading will be enabled, and "fl" contains a very
>> >>>>> small
>> >>>>>>> set
>> >>>>>>>> of fields. BUT, I need to return highlighting on about (possibly)
>> >>>> 20
>> >>>>>>>> fields. Does the highlighting component use the documentCache? I
>> >>>>> guess
>> >>>>>>> that
>> >>>>>>>> highlighting requires the whole field to be loaded into the
>> >>>>>>> documentCache.
>> >>>>>>>> Will it happen only for fields that matched a term from the
>> query?
>> >>>>>>>>
>> >>>>>>>> And one more question: I'm planning to hard-commit once a day.
>> >>>>> Should I
>> >>>>>>>> prepare to a significant RAM usage growth between hard-commits?
>> >>>>>>> (consider a
>> >>>>>>>> lot of new documents in this period...)
>> >>>>>>>> Does this RAM comes from the same pool as the caches? An
>> >>>> OutOfMemory
>> >>>>>>>> exception can happen is this scenario?
>> >>>>>>>>
>> >>>>>>>> Thanks a lot.
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >
>> > --
>> > Walter Underwood
>> > wun...@wunderwood.org
>> >
>> >
>> >
>>
>
>

Re: Solr cache considerations

Reply via email to