Re: query with stemming, prefix and fuzzy?

Mark Miller Thu, 29 Jan 2009 09:40:24 -0800

Truncation queries and stemming are difficult partners. You likely haveto accept compromise. You can try using multiple fields like you are,you can try indexing the full term at the same position as the stemmedterm, or you can accept the weirdness that comes from matching on astemmed form (potentially very confusing for a user).

In any case though, a queryparser that support fuzzyquery should not beanalyzing it. What parser are you using? If it is analyzing the fuzzysyntax, it doesnt likely support it.

Fuzzy queries are slow - especially if they match a lot of terms. Abooleanquery is created with a clause for each term, and then an editdistance is calculated to filter out what doesnt match.

The prefix length determines how many terms are enumerated - with thedefault of 0, every term is enumerated I think. And an edit distance iscalculated to filter them out. Thats real slow - a longer prefix willsignificantly cut down the number of terms that need to be enumerated.

Think of mark~0.6 - with a 0 prefix I will enumerate every term andcheck the edit distance. With a 2 prefix I will only enumerate the termsthat start with ma, and calculate an edit distance. One might be just abit faster.

The latest trunk build on Lucene will let us switch fuzzy query to use aconstant score mode - this will eliminate the booleanquery and shouldperform much better on a large index. Solr already uses a constant scoremode for Prefix and Wildcard queries.

How big is your index? If its not that big, it may be odd that yourseeing things that slow (number of unique terms in the index will play alarge role).


- Mark

Gert Brinkmann wrote:

Hello,

I am trying to get Solr to properly work. I have set up a Solr test
server (using jetty as mentioned in the tutorial). Also I had to modify
the schema.xml so that I have different fields for different languages
(with their own stemmers) that occur in the content management system
that I am indexing. So far everything does work fine including snippet
highlighting.

But now I am having some problems with two things:

A) fuzzy search

When trying to do a fuzzy search the analyzers seem to break up a search
string like "house~0.6" into "house", "0" and "6" so that e.g. a single
"6" is highlighted, too. So I tried to use an additional raw-field
without any stemming and just a lower case and white space analyzer.
This seems to work fine. But fuzzy query is very slow and takes 100% CPU
for several seconds with only one query at a time.

What can I do to speed up the fuzzy query? I e.g. have found a Lucene
parameter prefixLength but no according Solr option. Does this exist?
Are there some other options to pay attention to?


B) combine stemming, prefix and fuzzy search

Is there a way to combine all this three query types in one query?
Especially stemming and prefixing? I think it would be problematic as a
"house*" would be analyzed to "house" with the usual analyzers that are
required for stemming?

Do I need different query type fields and combine them with an boolean
OR in the query? Something like

  data:house OR data_fuzzy:house~0.6 OR data_prefix:house*

This feels to be a little bit circuitous. Is there a way to use
"house*~.6" including correct stemming?

Thank you,
Gert

Re: query with stemming, prefix and fuzzy?

Reply via email to