Truncation queries and stemming are difficult partners. You likely have
to accept compromise. You can try using multiple fields like you are,
you can try indexing the full term at the same position as the stemmed
term, or you can accept the weirdness that comes from matching on a
stemmed form (potentially very confusing for a user).
In any case though, a queryparser that support fuzzyquery should not be
analyzing it. What parser are you using? If it is analyzing the fuzzy
syntax, it doesnt likely support it.
Fuzzy queries are slow - especially if they match a lot of terms. A
booleanquery is created with a clause for each term, and then an edit
distance is calculated to filter out what doesnt match.
The prefix length determines how many terms are enumerated - with the
default of 0, every term is enumerated I think. And an edit distance is
calculated to filter them out. Thats real slow - a longer prefix will
significantly cut down the number of terms that need to be enumerated.
Think of mark~0.6 - with a 0 prefix I will enumerate every term and
check the edit distance. With a 2 prefix I will only enumerate the terms
that start with ma, and calculate an edit distance. One might be just a
bit faster.
The latest trunk build on Lucene will let us switch fuzzy query to use a
constant score mode - this will eliminate the booleanquery and should
perform much better on a large index. Solr already uses a constant score
mode for Prefix and Wildcard queries.
How big is your index? If its not that big, it may be odd that your
seeing things that slow (number of unique terms in the index will play a
large role).
- Mark
Gert Brinkmann wrote:
Hello,
I am trying to get Solr to properly work. I have set up a Solr test
server (using jetty as mentioned in the tutorial). Also I had to modify
the schema.xml so that I have different fields for different languages
(with their own stemmers) that occur in the content management system
that I am indexing. So far everything does work fine including snippet
highlighting.
But now I am having some problems with two things:
A) fuzzy search
When trying to do a fuzzy search the analyzers seem to break up a search
string like "house~0.6" into "house", "0" and "6" so that e.g. a single
"6" is highlighted, too. So I tried to use an additional raw-field
without any stemming and just a lower case and white space analyzer.
This seems to work fine. But fuzzy query is very slow and takes 100% CPU
for several seconds with only one query at a time.
What can I do to speed up the fuzzy query? I e.g. have found a Lucene
parameter prefixLength but no according Solr option. Does this exist?
Are there some other options to pay attention to?
B) combine stemming, prefix and fuzzy search
Is there a way to combine all this three query types in one query?
Especially stemming and prefixing? I think it would be problematic as a
"house*" would be analyzed to "house" with the usual analyzers that are
required for stemming?
Do I need different query type fields and combine them with an boolean
OR in the query? Something like
data:house OR data_fuzzy:house~0.6 OR data_prefix:house*
This feels to be a little bit circuitous. Is there a way to use
"house*~.6" including correct stemming?
Thank you,
Gert