Thanks, Mark, for your answer, Mark Miller wrote: > Truncation queries and stemming are difficult partners. You likely have > to accept compromise. You can try using multiple fields like you are,
I already have multiple fields, one per language, to be able to use different stemmers. Wouldn't become this too much? > you can try indexing the full term at the same position as the stemmed > term, what does this mean "at the same position" and how could I do this? > or you can accept the weirdness that comes from matching on a > stemmed form (potentially very confusing for a user). Currently I think about dropping the stemming and only use prefix-search. But as highlighting does not work with a prefix "house*" this is a problem for me. The hint to use "house?*" instead does not work here. > In any case though, a queryparser that support fuzzyquery should not be > analyzing it. What parser are you using? If it is analyzing the fuzzy > syntax, it doesnt likely support it. I am using the following definitions (testing it with and without stemming): > <fieldType name="text_de_de" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <!-- in this example, we will only use synonyms at query time > <filter class="solr.SynonymFilterFactory" > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> > --> > <!-- Case insensitive stop word removal. > enablePositionIncrements=true ensures that a 'gap' is left to > allow for accurate phrase queries. > --> > <filter class="solr.StopFilterFactory" > ignoreCase="true" > words="stopwords_de_de.txt" > enablePositionIncrements="true" > /> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" > splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> > --> > <!-- <filter class="solr.ISOLatin1AccentFilterFactory"/> --> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" > synonyms="synonyms_de_de.txt" ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords_de_de.txt"/> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" > splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> > --> > <!-- <filter class="solr.ISOLatin1AccentFilterFactory"/> --> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> and, well, the parser? Where is the parser specified? Do you mean the request handler "qt" (that will be "standard", as I do not set it yet)? > The prefix length determines how many terms are enumerated - with the Can the prefix length be set in Solr? I could not find such an option. > The latest trunk build on Lucene will let us switch fuzzy query to use a > constant score mode - this will eliminate the booleanquery and should > perform much better on a large index. Solr already uses a constant score > mode for Prefix and Wildcard queries. much better performance is always good. When will this feature be available in Solr? > How big is your index? If its not that big, it may be odd that your > seeing things that slow (number of unique terms in the index will play a > large role). Well, the index currently contains about 5000 documents. These are HTML-pages, some of them are concatenated with PDF/DOCs (Downloads linked from the HTML-page) converted to text. The index data is about 11MB (optimized). So think, this is just a smaller index. Greetings, Gert