Wow Erick, The MMap acrtivle is a very fundamental one. Totaly changed my view. It must be mentioned in SolrPerformanceFactors in SolrWiki... I'm sorry I did not know it before. Thank you a lot. I promise to share my results then my cart will start to fly :)
On Sun, Jan 20, 2013 at 6:08 PM, Erick Erickson <erickerick...@gmail.com>wrote: > About your question about document cache: Typically the document cache > has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very > often. And remember that this cache is only hit when assembling the > response for a few documents (your page size). > > Bottom line: I wouldn't worry about this cache much. It's quite useful > for processing a particular query faster, but not really intended for > cross-query use. > > Really, I think you're getting the cart before the horse here. Run it > up the flagpole and try it. Rely on the OS to do its job > (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html). > Find a bottleneck _then_ tune. Premature optimization and all > that.... > > Several tens of millions of docs isn't that large unless the text > fields are enormous. > > Best > Erick > > On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh <isaac.he...@gmail.com> > wrote: > > Ok. Thank you everyone for your helpful answers. > > I understand that fieldValueCache is not used for resolving queries. > > Is there any cache that can help this basic scenario (a lot of different > > queries, on a small set of fields)? > > Does Lucene's FieldCache help (implicitly)? > > How can I use RAM to reduce I/O in this type of queries? > > > > > > On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe < > > tomasflo...@gmail.com> wrote: > > > >> No, the fieldValueCache is not used for resolving queries. Only for > >> multi-token faceting and apparently for the stats component too. The > >> document cache maintains in memory the stored content of the fields you > are > >> retrieving or highlighting on. It'll hit if the same document matches > the > >> query multiple times and the same fields are requested, but as Eirck > said, > >> it is important for cases when multiple components in the same request > need > >> to access the same data. > >> > >> I think soft committing every 10 minutes is totally fine, but you should > >> hard commit more often if you are going to be using transaction log. > >> openSearcher=false will essentially tell Solr not to open a new searcher > >> after the (hard) commit, so you won't see the new indexed data and > caches > >> wont be flushed. openSearcher=false makes sense when you are using > >> hard-commits together with soft-commits, as the "soft-commit" is dealing > >> with opening/closing searchers, you don't need hard commits to do it. > >> > >> Tomás > >> > >> > >> On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh <isaac.he...@gmail.com> > >> wrote: > >> > >> > Unfortunately, it seems ( > >> > http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) > that > >> > these caches are not per-segment. In this case, I want to (soft) > commit > >> > less frequently. Am I right? > >> > > >> > Tomás, as the fieldValueCache is very similar to lucene's FieldCache, > I > >> > guess it has a big contribution to standard (not only faceted) queries > >> > time. SolrWiki claims that it primarily used by faceting. What that > says > >> > about complex textual queries? > >> > > >> > documentCache: > >> > Erick, After a query processing is finished, doesn't some documents > stay > >> in > >> > the documentCache? can't I use it to accelerate queries that should > >> > retrieve stored fields of documents? In this case, a big documentCache > >> can > >> > hold more documents.. > >> > > >> > About commit frequency: > >> > HardCommit: "openSearch=false" seems as a nice solution. Where can I > read > >> > about this? (found nothing but one unexplained sentence in SolrWiki). > >> > SoftCommit: In my case, the required index freshness is 10 minutes. > The > >> > plan to soft commit every 10 minutes is similar to storing all of the > >> > documents in a queue (outside to Solr), an indexing a bulk every 10 > >> > minutes. > >> > > >> > Thanks. > >> > > >> > > >> > On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe < > >> > tomasflo...@gmail.com> wrote: > >> > > >> > > I think fieldValueCache is not per segment, only fieldCache is. > >> However, > >> > > unless I'm missing something, this cache is only used for faceting > on > >> > > multivalued fields > >> > > > >> > > > >> > > On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson < > >> erickerick...@gmail.com > >> > > >wrote: > >> > > > >> > > > filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters > in > >> > > > cache). Notice the /8. This reflects the fact that the filters are > >> > > > represented by a bitset on the _internal_ Lucene ID. UniqueId has > no > >> > > > bearing here whatsoever. This is, in a nutshell, why warming is > >> > > > required, the internal Lucene IDs may change. Note also that it's > >> > > > maxDoc, the internal arrays have "holes" for deleted documents. > >> > > > > >> > > > Note this is an _upper_ bound, if there are only a few docs that > >> > > > match, the size will be (num of matching docs) * sizeof(int)). > >> > > > > >> > > > fieldValueCache. I don't think so, although I'm a bit fuzzy on > this. > >> > > > It depends on whether these are "per-segment" caches or not. Any > "per > >> > > > segment" cache is still valid. > >> > > > > >> > > > Think of documentCache as intended to hold the stored fields while > >> > > > various components operate on it, thus avoiding repeatedly > fetching > >> > > > the data from disk. It's _usually_ not too big a worry. > >> > > > > >> > > > About hard-commits once a day. That's _extremely_ long. Think > instead > >> > > > of committing more frequently with openSearcher=false. If nothing > >> > > > else, you transaction log will grow lots and lots and lots. I'm > >> > > > thinking on the order of 15 minutes, or possibly even much less. > With > >> > > > softCommits happening more often, maybe every 15 seconds. In fact, > >> I'd > >> > > > start out with soft commits every 15 seconds and hard commits > >> > > > (openSearcher=false) every 5 minutes. The problem with hard > commits > >> > > > being once a day is that, if for any reason the server is > >> interrupted, > >> > > > on startup Solr will try to replay the entire transaction log to > >> > > > assure index integrity. Not to mention that your tlog will be > huge. > >> > > > Not to mention that there is some memory usage for each document > in > >> > > > the tlog. Hard commits roll over the tlog, flush the in-memory > tlog > >> > > > pointers, close index segments, etc. > >> > > > > >> > > > Best > >> > > > Erick > >> > > > > >> > > > On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh < > isaac.he...@gmail.com> > >> > > > wrote: > >> > > > > Hi, > >> > > > > > >> > > > > I am going to build a big Solr (4.0?) index, which holds some > >> dozens > >> > of > >> > > > > millions of documents. Each document has some dozens of fields, > and > >> > one > >> > > > big > >> > > > > textual field. > >> > > > > The queries on the index are non-trivial, and a little-bit long > >> > (might > >> > > be > >> > > > > hundreds of terms). No query is identical to another. > >> > > > > > >> > > > > Now, I want to analyze the cache performance (before setting up > the > >> > > whole > >> > > > > environment), in order to estimate how much RAM will I need. > >> > > > > > >> > > > > filterCache: > >> > > > > In my scenariom, every query has some filters. let's say that > each > >> > > filter > >> > > > > matches 1M documents, out of 10M. Does the estimated memory > usage > >> > > should > >> > > > be > >> > > > > 1M * sizeof(uniqueId) * num-of-filters-in-cache? > >> > > > > > >> > > > > fieldValueCache: > >> > > > > Due to the difference between queries, I guess that > fieldValueCache > >> > is > >> > > > the > >> > > > > most important factor on query performance. Here comes a generic > >> > > > question: > >> > > > > I'm indexing new documents to the index constantly. Soft commits > >> will > >> > > be > >> > > > > performed every 10 mins. Does it say that the cache is > meaningless, > >> > > after > >> > > > > every 10 minutes? > >> > > > > > >> > > > > documentCache: > >> > > > > enableLazyFieldLoading will be enabled, and "fl" contains a very > >> > small > >> > > > set > >> > > > > of fields. BUT, I need to return highlighting on about > (possibly) > >> 20 > >> > > > > fields. Does the highlighting component use the documentCache? I > >> > guess > >> > > > that > >> > > > > highlighting requires the whole field to be loaded into the > >> > > > documentCache. > >> > > > > Will it happen only for fields that matched a term from the > query? > >> > > > > > >> > > > > And one more question: I'm planning to hard-commit once a day. > >> > Should I > >> > > > > prepare to a significant RAM usage growth between hard-commits? > >> > > > (consider a > >> > > > > lot of new documents in this period...) > >> > > > > Does this RAM comes from the same pool as the caches? An > >> OutOfMemory > >> > > > > exception can happen is this scenario? > >> > > > > > >> > > > > Thanks a lot. > >> > > > > >> > > > >> > > >> >