> On 6-Feb-08, at 4:32 PM, Fuad Efendi wrote: > > >> Indeed the field cache method works much better when the values are > >> single-valued. Unfortunately, there is no way for solr to > know that > >> the analyzer is only outputting a single token per > document, else we > >> could apply this optimization automatically. > > > > Thanks Mike, > > > > Some clarification: > > *single-valued* in my previous Email means *field-with-single-only- > > value* > > (in SOLR terms, multiValued="false"), and not a *single-token*. This > > *single-valued* field is analyzed/tokenized and it is > *multi-valued- > > token* > > so that fieldCache can't work. And I have extremely good performance > > improvements, *without* Lucene's FieldCache optimization! > > That seems extremely odd. Sure you aren't just sending fewer unique > tokens? > > -Mike >
Yes, that is true: I have probably 1,000,000 of unique tokens (at least, 1,000,000 size of filterCache) (tokens include different forms of words such as Telescope, Telescoping; I am not using EnglishPorterFilter yet...) Each single-value-field contains about 3-7 tokens; database size is 6,000,000 documents, and I reindexed 30% of a database(SOLR) by changing multi-value field to single-value (some filters...) I did this reindexing hoping to reduce total number of different tokens. I'll finish reindexing in a few (may be 24) hours :) If you browse website you may notice some large product names containing even product price as a separate field value; the same with Category where I use product name(s) but with different tokenizer; I am filtering product names now, including category. As a sample of multi-value product (and category) 'bad' data which has not been reindexed yet: http://www.tokenizer.org/large/price.htm I can't even say that index became smaller after reindexing; it is 1.6Gb, almost the same as before. -Fuad