Re: query with stemming, prefix and fuzzy?

Gert Brinkmann Fri, 30 Jan 2009 07:08:16 -0800

Thanks, Mark, for your answer,

Mark Miller wrote:
> Truncation queries and stemming are difficult partners. You likely have
> to accept compromise. You can try using multiple fields like you are,


I already have multiple fields, one per language, to be able to use
different stemmers. Wouldn't become this too much?

> you can try indexing the full term at the same position as the stemmed
> term,

what does this mean "at the same position" and how could I do this?

> or you can accept the weirdness that comes from matching on a
> stemmed form (potentially very confusing for a user).

Currently I think about dropping the stemming and only use
prefix-search. But as highlighting does not work with a prefix "house*"
this is a problem for me. The hint to use "house?*" instead does not
work here.

> In any case though, a queryparser that support fuzzyquery should not be
> analyzing it. What parser are you using? If it is analyzing the fuzzy
> syntax, it doesnt likely support it.

I am using the following definitions (testing it with and without stemming):
>     <fieldType name="text_de_de" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>         -->
>         <!-- Case insensitive stop word removal.
>              enablePositionIncrements=true ensures that a 'gap' is left to
>              allow for accurate phrase queries.
>         -->
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords_de_de.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
> splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
> <!--  <filter class="solr.SnowballPorterFilterFactory" language="German" /> 
> -->
> <!--         <filter class="solr.ISOLatin1AccentFilterFactory"/> -->
>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="synonyms_de_de.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords_de_de.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
> <!--  <filter class="solr.SnowballPorterFilterFactory" language="German" /> 
> -->
> <!--         <filter class="solr.ISOLatin1AccentFilterFactory"/> -->
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>

and, well, the parser? Where is the parser specified? Do you mean the
request handler "qt" (that will be "standard", as I do not set it yet)?


> The prefix length determines how many terms are enumerated - with the

Can the prefix length be set in Solr? I could not find such an option.

> The latest trunk build on Lucene will let us switch fuzzy query to use a
> constant score mode - this will eliminate the booleanquery and should
> perform much better on a large index. Solr already uses a constant score
> mode for Prefix and Wildcard queries.

much better performance is always good. When will this feature be
available in Solr?

> How big is your index? If its not that big, it may be odd that your
> seeing things that slow (number of unique terms in the index will play a
> large role).

Well, the index currently contains about 5000 documents. These are
HTML-pages, some of them are concatenated with PDF/DOCs (Downloads
linked from the HTML-page) converted to text. The index data is about
11MB (optimized). So think, this is just a smaller index.

Greetings,
Gert

Re: query with stemming, prefix and fuzzy?

Reply via email to