Hi,
Trying to debug a faceting performance problem. I've pretty much given up but
was hoping someone could shed some light on my problems.
My index has 80 million documents, all of which are small - one 1000 char text
field and a bunch of 30-50 char fields. Got 24G ram allocated to the jvm on a
brand new server.
I have one field in my schema which represents a city name. It is a non
standardized free text field, so you have problems like the following
HOUSTON
HOUSTON TX
HOUSTON, TX
HOUSTON (TX)
I would like to facet on this field and thought I could apply some tokenizers /
filters to modify the indexed value to strip out stopwords. To tie it all
together I created a filter that would concatenate all of the tokens back into
a single token at the end. Here's my field definition from schema.xml
<fieldType name="portCity" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<!-- stopwords common across all fields -->
<filter class="solr.StopFilterFactory"
words="stopwords.txt" enablePositionIncrements="true"/>
<!-- stopwords specific to port cities -->
<filter class="solr.StopFilterFactory"
words="portCityStopwords.txt" enablePositionIncrements="true"/>
<filter
class="solr.RemoveDuplicatesTokenFilterFactory"/>
<!-- pull tokens all together again -->
<filter
class="com.tradebytes.solr.ConcatenateFilterFactory"/>
</analyzer>
</fieldType>
The analysis seems to be working as I expected and the index contains the
values I want. However when I facet on this field the query returns in
typically around 30s, versus sub-second when I just use a solr.StrField. I
understand from the lists that the method that solr uses to create the facet
counts is different depending on whether the field is tokenized vs not
tokenized, but I thought I could mitigate that somewhat by making sure that
each field only had one token.
Is there anything else I can do here? Can someone shed some light on why a
tokenized field takes longer, even if there is only one token per field? I
suspect I am going to be stuck with implementing custom field translation
before loading but was hoping I could leverage some of the great filters that
are built in with solr / lucene. I've played around with fieldcache but so far
no luck.
BTW love solr / lucene, great job!
Thanks,
Simon