Re: Solr cache considerations

Erick Erickson Mon, 21 Jan 2013 10:39:28 -0800

Hmm, interesting. I'll have to look closer...


On Sun, Jan 20, 2013 at 3:50 PM, Walter Underwood <wun...@wunderwood.org> wrote:
> I routinely see hit rates over 75% on the document cache. Perhaps yours is 
> too small. Mine is set at 10240 entries.
>
> wunder
>
> On Jan 20, 2013, at 8:08 AM, Erick Erickson wrote:
>
>> About your question about document cache: Typically the document cache
>> has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very
>> often. And remember that this cache is only hit when assembling the
>> response for a few documents (your page size).
>>
>> Bottom line: I wouldn't worry about this cache much. It's quite useful
>> for processing a particular query faster, but not really intended for
>> cross-query use.
>>
>> Really, I think you're getting the cart before the horse here. Run it
>> up the flagpole and try it. Rely on the OS to do its job
>> (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
>> Find  a bottleneck _then_ tune. Premature optimization and all
>> that....
>>
>> Several tens of millions of docs isn't that large unless the text
>> fields are enormous.
>>
>> Best
>> Erick
>>
>> On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh <isaac.he...@gmail.com> wrote:
>>> Ok. Thank you everyone for your helpful answers.
>>> I understand that fieldValueCache is not used for resolving queries.
>>> Is there any cache that can help this basic scenario (a lot of different
>>> queries, on a small set of fields)?
>>> Does Lucene's FieldCache help (implicitly)?
>>> How can I use RAM to reduce I/O in this type of queries?
>>>
>>>
>>> On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe <
>>> tomasflo...@gmail.com> wrote:
>>>
>>>> No, the fieldValueCache is not used for resolving queries. Only for
>>>> multi-token faceting and apparently for the stats component too. The
>>>> document cache maintains in memory the stored content of the fields you are
>>>> retrieving or highlighting on. It'll hit if the same document matches the
>>>> query multiple times and the same fields are requested, but as Eirck said,
>>>> it is important for cases when multiple components in the same request need
>>>> to access the same data.
>>>>
>>>> I think soft committing every 10 minutes is totally fine, but you should
>>>> hard commit more often if you are going to be using transaction log.
>>>> openSearcher=false will essentially tell Solr not to open a new searcher
>>>> after the (hard) commit, so you won't see the new indexed data and caches
>>>> wont be flushed. openSearcher=false makes sense when you are using
>>>> hard-commits together with soft-commits, as the "soft-commit" is dealing
>>>> with opening/closing searchers, you don't need hard commits to do it.
>>>>
>>>> Tomás
>>>>
>>>>
>>>> On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh <isaac.he...@gmail.com>
>>>> wrote:
>>>>
>>>>> Unfortunately, it seems (
>>>>> http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that
>>>>> these caches are not per-segment. In this case, I want to (soft) commit
>>>>> less frequently. Am I right?
>>>>>
>>>>> Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I
>>>>> guess it has a big contribution to standard (not only faceted) queries
>>>>> time. SolrWiki claims that it primarily used by faceting. What that says
>>>>> about complex textual queries?
>>>>>
>>>>> documentCache:
>>>>> Erick, After a query processing is finished, doesn't some documents stay
>>>> in
>>>>> the documentCache? can't I use it to accelerate queries that should
>>>>> retrieve stored fields of documents? In this case, a big documentCache
>>>> can
>>>>> hold more documents..
>>>>>
>>>>> About commit frequency:
>>>>> HardCommit: "openSearch=false" seems as a nice solution. Where can I read
>>>>> about this? (found nothing but one unexplained sentence in SolrWiki).
>>>>> SoftCommit: In my case, the required index freshness is 10 minutes. The
>>>>> plan to soft commit every 10 minutes is similar to storing all of the
>>>>> documents in a queue (outside to Solr), an indexing a bulk every 10
>>>>> minutes.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe <
>>>>> tomasflo...@gmail.com> wrote:
>>>>>
>>>>>> I think fieldValueCache is not per segment, only fieldCache is.
>>>> However,
>>>>>> unless I'm missing something, this cache is only used for faceting on
>>>>>> multivalued fields
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson <
>>>> erickerick...@gmail.com
>>>>>>> wrote:
>>>>>>
>>>>>>> filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
>>>>>>> cache). Notice the /8. This reflects the fact that the filters are
>>>>>>> represented by a bitset on the _internal_ Lucene ID. UniqueId has no
>>>>>>> bearing here whatsoever. This is, in a nutshell, why warming is
>>>>>>> required, the internal Lucene IDs may change. Note also that it's
>>>>>>> maxDoc, the internal arrays have "holes" for deleted documents.
>>>>>>>
>>>>>>> Note this is an _upper_ bound, if there are only a few docs that
>>>>>>> match, the size will be (num of matching docs) * sizeof(int)).
>>>>>>>
>>>>>>> fieldValueCache. I don't think so, although I'm a bit fuzzy on this.
>>>>>>> It depends on whether these are "per-segment" caches or not. Any "per
>>>>>>> segment" cache is still valid.
>>>>>>>
>>>>>>> Think of documentCache as intended to hold the stored fields while
>>>>>>> various components operate on it, thus avoiding repeatedly fetching
>>>>>>> the data from disk. It's _usually_ not too big a worry.
>>>>>>>
>>>>>>> About hard-commits once a day. That's _extremely_ long. Think instead
>>>>>>> of committing more frequently with openSearcher=false. If nothing
>>>>>>> else, you transaction log will grow lots and lots and lots. I'm
>>>>>>> thinking on the order of 15 minutes, or possibly even much less. With
>>>>>>> softCommits happening more often, maybe every 15 seconds. In fact,
>>>> I'd
>>>>>>> start out with soft commits every 15 seconds and hard commits
>>>>>>> (openSearcher=false) every 5 minutes. The problem with hard commits
>>>>>>> being once a day is that, if for any reason the server is
>>>> interrupted,
>>>>>>> on startup Solr will try to replay the entire transaction log to
>>>>>>> assure index integrity. Not to mention that your tlog will be huge.
>>>>>>> Not to mention that there is some memory usage for each document in
>>>>>>> the tlog. Hard commits roll over the tlog, flush the in-memory tlog
>>>>>>> pointers, close index segments, etc.
>>>>>>>
>>>>>>> Best
>>>>>>> Erick
>>>>>>>
>>>>>>> On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh <isaac.he...@gmail.com>
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am going to build a big Solr (4.0?) index, which holds some
>>>> dozens
>>>>> of
>>>>>>>> millions of documents. Each document has some dozens of fields, and
>>>>> one
>>>>>>> big
>>>>>>>> textual field.
>>>>>>>> The queries on the index are non-trivial, and a little-bit long
>>>>> (might
>>>>>> be
>>>>>>>> hundreds of terms). No query is identical to another.
>>>>>>>>
>>>>>>>> Now, I want to analyze the cache performance (before setting up the
>>>>>> whole
>>>>>>>> environment), in order to estimate how much RAM will I need.
>>>>>>>>
>>>>>>>> filterCache:
>>>>>>>> In my scenariom, every query has some filters. let's say that each
>>>>>> filter
>>>>>>>> matches 1M documents, out of 10M. Does the estimated memory usage
>>>>>> should
>>>>>>> be
>>>>>>>> 1M * sizeof(uniqueId) * num-of-filters-in-cache?
>>>>>>>>
>>>>>>>> fieldValueCache:
>>>>>>>> Due to the difference between queries, I guess that fieldValueCache
>>>>> is
>>>>>>> the
>>>>>>>> most important factor on query performance. Here comes a generic
>>>>>>> question:
>>>>>>>> I'm indexing new documents to the index constantly. Soft commits
>>>> will
>>>>>> be
>>>>>>>> performed every 10 mins. Does it say that the cache is meaningless,
>>>>>> after
>>>>>>>> every 10 minutes?
>>>>>>>>
>>>>>>>> documentCache:
>>>>>>>> enableLazyFieldLoading will be enabled, and "fl" contains a very
>>>>> small
>>>>>>> set
>>>>>>>> of fields. BUT, I need to return highlighting on about (possibly)
>>>> 20
>>>>>>>> fields. Does the highlighting component use the documentCache? I
>>>>> guess
>>>>>>> that
>>>>>>>> highlighting requires the whole field to be loaded into the
>>>>>>> documentCache.
>>>>>>>> Will it happen only for fields that matched a term from the query?
>>>>>>>>
>>>>>>>> And one more question: I'm planning to hard-commit once a day.
>>>>> Should I
>>>>>>>> prepare to a significant RAM usage growth between hard-commits?
>>>>>>> (consider a
>>>>>>>> lot of new documents in this period...)
>>>>>>>> Does this RAM comes from the same pool as the caches? An
>>>> OutOfMemory
>>>>>>>> exception can happen is this scenario?
>>>>>>>>
>>>>>>>> Thanks a lot.
>>>>>>>
>>>>>>
>>>>>
>>>>
>
> --
> Walter Underwood
> wun...@wunderwood.org
>
>
>

Re: Solr cache considerations

Reply via email to