Hello Simon,

I'll assume you are using Solr 1.3.  Grab the latest Solr nightly and try with 
that - your multi-token facets should be faster (are you sure, sure sure you 
are ending up with a single token).

Also, unrelated to this most probably is the suspiciously large JVM heap.  My 
guess is it's too large.  Solr will be happier if you leave some RAM to the OS 
to cache the index itself.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Simon Stanlake <sim...@tradebytes.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Wednesday, April 29, 2009 9:55:03 PM
> Subject: understanding facets and tokens
> 
> Hi,
> Trying to debug a faceting performance problem. I've pretty much given up but 
> was hoping someone could shed some light on my problems.
> 
> My index has 80 million documents, all of which are small - one 1000 char 
> text 
> field and a bunch of 30-50 char fields. Got 24G ram allocated to the jvm on a 
> brand new server.
> 
> I have one field in my schema which represents a city name. It is a non 
> standardized free text field, so you have problems like the following
> 
> HOUSTON
> HOUSTON TX
> HOUSTON, TX
> HOUSTON (TX)
> 
> I would like to facet on this field and thought I could apply some tokenizers 
> / 
> filters to modify the indexed value to strip out stopwords. To tie it all 
> together I created a filter that would concatenate all of the tokens back 
> into a 
> single token at the end. Here's my field definition from schema.xml
> 
>     
>     
>         
>             
>             
>             
>             
> enablePositionIncrements="true"/>
>             
>             
> enablePositionIncrements="true"/>
>             
>             
>             
>         
>     
> 
> The analysis seems to be working as I expected and the index contains the 
> values 
> I want. However when I facet on this field the query returns in typically 
> around 
> 30s, versus sub-second when I just use a solr.StrField. I understand from the 
> lists that the method that solr uses to create the facet counts is different 
> depending on whether the field is tokenized vs not tokenized, but I thought I 
> could mitigate that somewhat by making sure that each field only had one 
> token.
> 
> Is there anything else I can do here? Can someone shed some light on why a 
> tokenized field takes longer, even if there is only one token per field? I 
> suspect I am going to be stuck with implementing custom field translation 
> before 
> loading but was hoping I could leverage some of the great filters that are 
> built 
> in with solr / lucene. I've played around with fieldcache but so far no luck.
> 
> BTW love solr / lucene, great job!
> 
> Thanks,
> Simon

Reply via email to