Re: query with stemming, prefix and fuzzy?

Mark Miller Fri, 30 Jan 2009 10:07:42 -0800

Gert Brinkmann wrote:

Thanks, Mark, for your answer,


Mark Miller wrote:

Truncation queries and stemming are difficult partners. You likely have
to accept compromise. You can try using multiple fields like you are,


I already have multiple fields, one per language, to be able to use
different stemmers. Wouldn't become this too much?

Possibly. Especially if you are using norms with all of those fields.Depends on your index though.

you can try indexing the full term at the same position as the stemmed
term,


what does this mean "at the same position" and how could I do this?

Write a custom filter. Normally, for every term, its position isincremented by 1 as the terms are broken out in tokenization. You canchange this and index terms at the same position using your own filter.There are ramifications, because you are adding more terms to yourindex, but it allows you to index multiple forms of a term at the sameposition (so that phrase queries still work as expected).

or you can accept the weirdness that comes from matching on a
stemmed form (potentially very confusing for a user).


Currently I think about dropping the stemming and only use
prefix-search. But as highlighting does not work with a prefix "house*"
this is a problem for me. The hint to use "house?*" instead does not
work here.

Thats because wildcard queries are also not highlightable now. Iactually have somewhat of a solution to this that I'll work on soon(I've gotten the ground work for it in or ready to be in Lucene). Noguarantee on when or if it will be accepted in solr though.

In any case though, a queryparser that support fuzzyquery should not be
analyzing it. What parser are you using? If it is analyzing the fuzzy
syntax, it doesnt likely support it.


I am using the following definitions (testing it with and without stemming):

    <fieldType name="text_de_de" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" 
ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
             enablePositionIncrements=true ensures that a 'gap' is left to
             allow for accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords_de_de.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" 
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
<!--         <filter class="solr.SnowballPorterFilterFactory" language="German" /> 
-->
<!--         <filter class="solr.ISOLatin1AccentFilterFactory"/> -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de_de.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_de_de.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" 
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
<!--         <filter class="solr.SnowballPorterFilterFactory" language="German" /> 
-->
<!--         <filter class="solr.ISOLatin1AccentFilterFactory"/> -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>


and, well, the parser? Where is the parser specified? Do you mean the
request handler "qt" (that will be "standard", as I do not set it yet)?

Thats odd. I'll have to look at this closer to be of help.

The prefix length determines how many terms are enumerated - with the


Can the prefix length be set in Solr? I could not find such an option.

I don't think there is an option in Solr. Patches welcome of course. Itwould be a nice one - using the default of 0 is *very* not scalable.

The latest trunk build on Lucene will let us switch fuzzy query to use a
constant score mode - this will eliminate the booleanquery and should
perform much better on a large index. Solr already uses a constant score
mode for Prefix and Wildcard queries.


much better performance is always good. When will this feature be
available in Solr?

Soon I hope. Since wildcard and prefix are already constant score, itonly makes sense to make fuzzy query that way as well.

How big is your index? If its not that big, it may be odd that your
seeing things that slow (number of unique terms in the index will play a
large role).


Well, the index currently contains about 5000 documents. These are
HTML-pages, some of them are concatenated with PDF/DOCs (Downloads
linked from the HTML-page) converted to text. The index data is about
11MB (optimized). So think, this is just a smaller index.

Yeah, sounds small. Its odd you would see such slow performance. Itdepends though. You may still have a *lot* of unique terms in there.

Re: query with stemming, prefix and fuzzy?

Reply via email to