Hmm, interesting. I'll have to look closer...
On Sun, Jan 20, 2013 at 3:50 PM, Walter Underwood <wun...@wunderwood.org> wrote: > I routinely see hit rates over 75% on the document cache. Perhaps yours is > too small. Mine is set at 10240 entries. > > wunder > > On Jan 20, 2013, at 8:08 AM, Erick Erickson wrote: > >> About your question about document cache: Typically the document cache >> has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very >> often. And remember that this cache is only hit when assembling the >> response for a few documents (your page size). >> >> Bottom line: I wouldn't worry about this cache much. It's quite useful >> for processing a particular query faster, but not really intended for >> cross-query use. >> >> Really, I think you're getting the cart before the horse here. Run it >> up the flagpole and try it. Rely on the OS to do its job >> (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html). >> Find a bottleneck _then_ tune. Premature optimization and all >> that.... >> >> Several tens of millions of docs isn't that large unless the text >> fields are enormous. >> >> Best >> Erick >> >> On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh <isaac.he...@gmail.com> wrote: >>> Ok. Thank you everyone for your helpful answers. >>> I understand that fieldValueCache is not used for resolving queries. >>> Is there any cache that can help this basic scenario (a lot of different >>> queries, on a small set of fields)? >>> Does Lucene's FieldCache help (implicitly)? >>> How can I use RAM to reduce I/O in this type of queries? >>> >>> >>> On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe < >>> tomasflo...@gmail.com> wrote: >>> >>>> No, the fieldValueCache is not used for resolving queries. Only for >>>> multi-token faceting and apparently for the stats component too. The >>>> document cache maintains in memory the stored content of the fields you are >>>> retrieving or highlighting on. It'll hit if the same document matches the >>>> query multiple times and the same fields are requested, but as Eirck said, >>>> it is important for cases when multiple components in the same request need >>>> to access the same data. >>>> >>>> I think soft committing every 10 minutes is totally fine, but you should >>>> hard commit more often if you are going to be using transaction log. >>>> openSearcher=false will essentially tell Solr not to open a new searcher >>>> after the (hard) commit, so you won't see the new indexed data and caches >>>> wont be flushed. openSearcher=false makes sense when you are using >>>> hard-commits together with soft-commits, as the "soft-commit" is dealing >>>> with opening/closing searchers, you don't need hard commits to do it. >>>> >>>> Tomás >>>> >>>> >>>> On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh <isaac.he...@gmail.com> >>>> wrote: >>>> >>>>> Unfortunately, it seems ( >>>>> http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that >>>>> these caches are not per-segment. In this case, I want to (soft) commit >>>>> less frequently. Am I right? >>>>> >>>>> Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I >>>>> guess it has a big contribution to standard (not only faceted) queries >>>>> time. SolrWiki claims that it primarily used by faceting. What that says >>>>> about complex textual queries? >>>>> >>>>> documentCache: >>>>> Erick, After a query processing is finished, doesn't some documents stay >>>> in >>>>> the documentCache? can't I use it to accelerate queries that should >>>>> retrieve stored fields of documents? In this case, a big documentCache >>>> can >>>>> hold more documents.. >>>>> >>>>> About commit frequency: >>>>> HardCommit: "openSearch=false" seems as a nice solution. Where can I read >>>>> about this? (found nothing but one unexplained sentence in SolrWiki). >>>>> SoftCommit: In my case, the required index freshness is 10 minutes. The >>>>> plan to soft commit every 10 minutes is similar to storing all of the >>>>> documents in a queue (outside to Solr), an indexing a bulk every 10 >>>>> minutes. >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe < >>>>> tomasflo...@gmail.com> wrote: >>>>> >>>>>> I think fieldValueCache is not per segment, only fieldCache is. >>>> However, >>>>>> unless I'm missing something, this cache is only used for faceting on >>>>>> multivalued fields >>>>>> >>>>>> >>>>>> On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson < >>>> erickerick...@gmail.com >>>>>>> wrote: >>>>>> >>>>>>> filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in >>>>>>> cache). Notice the /8. This reflects the fact that the filters are >>>>>>> represented by a bitset on the _internal_ Lucene ID. UniqueId has no >>>>>>> bearing here whatsoever. This is, in a nutshell, why warming is >>>>>>> required, the internal Lucene IDs may change. Note also that it's >>>>>>> maxDoc, the internal arrays have "holes" for deleted documents. >>>>>>> >>>>>>> Note this is an _upper_ bound, if there are only a few docs that >>>>>>> match, the size will be (num of matching docs) * sizeof(int)). >>>>>>> >>>>>>> fieldValueCache. I don't think so, although I'm a bit fuzzy on this. >>>>>>> It depends on whether these are "per-segment" caches or not. Any "per >>>>>>> segment" cache is still valid. >>>>>>> >>>>>>> Think of documentCache as intended to hold the stored fields while >>>>>>> various components operate on it, thus avoiding repeatedly fetching >>>>>>> the data from disk. It's _usually_ not too big a worry. >>>>>>> >>>>>>> About hard-commits once a day. That's _extremely_ long. Think instead >>>>>>> of committing more frequently with openSearcher=false. If nothing >>>>>>> else, you transaction log will grow lots and lots and lots. I'm >>>>>>> thinking on the order of 15 minutes, or possibly even much less. With >>>>>>> softCommits happening more often, maybe every 15 seconds. In fact, >>>> I'd >>>>>>> start out with soft commits every 15 seconds and hard commits >>>>>>> (openSearcher=false) every 5 minutes. The problem with hard commits >>>>>>> being once a day is that, if for any reason the server is >>>> interrupted, >>>>>>> on startup Solr will try to replay the entire transaction log to >>>>>>> assure index integrity. Not to mention that your tlog will be huge. >>>>>>> Not to mention that there is some memory usage for each document in >>>>>>> the tlog. Hard commits roll over the tlog, flush the in-memory tlog >>>>>>> pointers, close index segments, etc. >>>>>>> >>>>>>> Best >>>>>>> Erick >>>>>>> >>>>>>> On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh <isaac.he...@gmail.com> >>>>>>> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am going to build a big Solr (4.0?) index, which holds some >>>> dozens >>>>> of >>>>>>>> millions of documents. Each document has some dozens of fields, and >>>>> one >>>>>>> big >>>>>>>> textual field. >>>>>>>> The queries on the index are non-trivial, and a little-bit long >>>>> (might >>>>>> be >>>>>>>> hundreds of terms). No query is identical to another. >>>>>>>> >>>>>>>> Now, I want to analyze the cache performance (before setting up the >>>>>> whole >>>>>>>> environment), in order to estimate how much RAM will I need. >>>>>>>> >>>>>>>> filterCache: >>>>>>>> In my scenariom, every query has some filters. let's say that each >>>>>> filter >>>>>>>> matches 1M documents, out of 10M. Does the estimated memory usage >>>>>> should >>>>>>> be >>>>>>>> 1M * sizeof(uniqueId) * num-of-filters-in-cache? >>>>>>>> >>>>>>>> fieldValueCache: >>>>>>>> Due to the difference between queries, I guess that fieldValueCache >>>>> is >>>>>>> the >>>>>>>> most important factor on query performance. Here comes a generic >>>>>>> question: >>>>>>>> I'm indexing new documents to the index constantly. Soft commits >>>> will >>>>>> be >>>>>>>> performed every 10 mins. Does it say that the cache is meaningless, >>>>>> after >>>>>>>> every 10 minutes? >>>>>>>> >>>>>>>> documentCache: >>>>>>>> enableLazyFieldLoading will be enabled, and "fl" contains a very >>>>> small >>>>>>> set >>>>>>>> of fields. BUT, I need to return highlighting on about (possibly) >>>> 20 >>>>>>>> fields. Does the highlighting component use the documentCache? I >>>>> guess >>>>>>> that >>>>>>>> highlighting requires the whole field to be loaded into the >>>>>>> documentCache. >>>>>>>> Will it happen only for fields that matched a term from the query? >>>>>>>> >>>>>>>> And one more question: I'm planning to hard-commit once a day. >>>>> Should I >>>>>>>> prepare to a significant RAM usage growth between hard-commits? >>>>>>> (consider a >>>>>>>> lot of new documents in this period...) >>>>>>>> Does this RAM comes from the same pool as the caches? An >>>> OutOfMemory >>>>>>>> exception can happen is this scenario? >>>>>>>> >>>>>>>> Thanks a lot. >>>>>>> >>>>>> >>>>> >>>> > > -- > Walter Underwood > wun...@wunderwood.org > > >