understanding facets and tokens

Simon Stanlake Wed, 29 Apr 2009 18:55:34 -0700

Hi,
Trying to debug a faceting performance problem. I've pretty much given up but 
was hoping someone could shed some light on my problems.


My index has 80 million documents, all of which are small - one 1000 char text 
field and a bunch of 30-50 char fields. Got 24G ram allocated to the jvm on a 
brand new server.

I have one field in my schema which represents a city name. It is a non 
standardized free text field, so you have problems like the following

HOUSTON
HOUSTON TX
HOUSTON, TX
HOUSTON (TX)

I would like to facet on this field and thought I could apply some tokenizers / 
filters to modify the indexed value to strip out stopwords. To tie it all 
together I created a filter that would concatenate all of the tokens back into 
a single token at the end. Here's my field definition from schema.xml

        
        <fieldType name="portCity" class="solr.TextField">
                <analyzer>
                        <tokenizer class="solr.StandardTokenizerFactory"/>
                        <filter class="solr.StandardFilterFactory"/>
                        <!-- stopwords common across all fields -->
                        <filter class="solr.StopFilterFactory" 
words="stopwords.txt" enablePositionIncrements="true"/>
                        <!-- stopwords specific to port cities -->
                        <filter class="solr.StopFilterFactory" 
words="portCityStopwords.txt" enablePositionIncrements="true"/>
                        <filter 
class="solr.RemoveDuplicatesTokenFilterFactory"/>
                        <!-- pull tokens all together again -->
                        <filter 
class="com.tradebytes.solr.ConcatenateFilterFactory"/>
                </analyzer>
        </fieldType>

The analysis seems to be working as I expected and the index contains the 
values I want. However when I facet on this field the query returns in 
typically around 30s, versus sub-second when I just use a solr.StrField. I 
understand from the lists that the method that solr uses to create the facet 
counts is different depending on whether the field is tokenized vs not 
tokenized, but I thought I could mitigate that somewhat by making sure that 
each field only had one token.

Is there anything else I can do here? Can someone shed some light on why a 
tokenized field takes longer, even if there is only one token per field? I 
suspect I am going to be stuck with implementing custom field translation 
before loading but was hoping I could leverage some of the great filters that 
are built in with solr / lucene. I've played around with fieldcache but so far 
no luck.

BTW love solr / lucene, great job!

Thanks,
Simon

understanding facets and tokens

Reply via email to