Hi, Trying to debug a faceting performance problem. I've pretty much given up but was hoping someone could shed some light on my problems.
My index has 80 million documents, all of which are small - one 1000 char text field and a bunch of 30-50 char fields. Got 24G ram allocated to the jvm on a brand new server. I have one field in my schema which represents a city name. It is a non standardized free text field, so you have problems like the following HOUSTON HOUSTON TX HOUSTON, TX HOUSTON (TX) I would like to facet on this field and thought I could apply some tokenizers / filters to modify the indexed value to strip out stopwords. To tie it all together I created a filter that would concatenate all of the tokens back into a single token at the end. Here's my field definition from schema.xml <fieldType name="portCity" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <!-- stopwords common across all fields --> <filter class="solr.StopFilterFactory" words="stopwords.txt" enablePositionIncrements="true"/> <!-- stopwords specific to port cities --> <filter class="solr.StopFilterFactory" words="portCityStopwords.txt" enablePositionIncrements="true"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <!-- pull tokens all together again --> <filter class="com.tradebytes.solr.ConcatenateFilterFactory"/> </analyzer> </fieldType> The analysis seems to be working as I expected and the index contains the values I want. However when I facet on this field the query returns in typically around 30s, versus sub-second when I just use a solr.StrField. I understand from the lists that the method that solr uses to create the facet counts is different depending on whether the field is tokenized vs not tokenized, but I thought I could mitigate that somewhat by making sure that each field only had one token. Is there anything else I can do here? Can someone shed some light on why a tokenized field takes longer, even if there is only one token per field? I suspect I am going to be stuck with implementing custom field translation before loading but was hoping I could leverage some of the great filters that are built in with solr / lucene. I've played around with fieldcache but so far no luck. BTW love solr / lucene, great job! Thanks, Simon