> On 6-Feb-08, at 4:32 PM, Fuad Efendi wrote:
> 
> >> Indeed the field cache method works much better when the values are
> >> single-valued.  Unfortunately, there is no way for solr to 
> know that
> >> the analyzer is only outputting a single token per 
> document, else we
> >> could apply this optimization automatically.
> >
> > Thanks Mike,
> >
> > Some clarification:
> > *single-valued* in my previous Email means *field-with-single-only- 
> > value*
> > (in SOLR terms, multiValued="false"), and not a *single-token*. This
> > *single-valued* field is analyzed/tokenized and it is 
> *multi-valued- 
> > token*
> > so that fieldCache can't work. And I have extremely good performance
> > improvements, *without* Lucene's FieldCache optimization!
> 
> That seems extremely odd.  Sure you aren't just sending fewer unique  
> tokens?
> 
> -Mike
> 


Yes, that is true: I have probably 1,000,000 of unique tokens (at least,
1,000,000 size of filterCache) (tokens include different forms of words such
as Telescope, Telescoping; I am not using EnglishPorterFilter yet...)

Each single-value-field contains about 3-7 tokens; database size is
6,000,000 documents, and I reindexed 30% of a database(SOLR) by changing
multi-value field to single-value (some filters...)

I did this reindexing hoping to reduce total number of different tokens.
I'll finish reindexing in a few (may be 24) hours :)

If you browse website you may notice some large product names containing
even product price as a separate field value; the same with Category where I
use product name(s) but with different tokenizer; I am filtering product
names now, including category.
As a sample of multi-value product (and category) 'bad' data which has not
been reindexed yet: http://www.tokenizer.org/large/price.htm

I can't even say that index became smaller after reindexing; it is 1.6Gb,
almost the same as before.

-Fuad

Reply via email to