Hello,
I am having some problems with solr 1.4. I am indexing and querying data
using the following fieldType:
<fieldType name="text_de_de" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords_de_de.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LengthFilterFactory" min="2" max="200"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_de_de.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords_de_de.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LengthFilterFactory" min="2" max="200"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
The application that is using solr does prepare the search string to
filter out some dangerous characters like brackets and wildcards, etc,
that otherwise might lead to a wrong query syntax.
All words are searched for as a normal word as well as a prefix. E.g.:
"für solr" is converted by the application to
(für OR für*) AND (solr OR solr*)
This works fine for normal words. But if you have a stopword like "für"
in this example, the query will be stopword filtered by solr to
something like this:
(für*) AND (solr OR solr*)
The problem now is (as I think) that there is no "für*" anymore in the
indexed data, because it was stopword filtered, too. If now someone
copy&pastes a sentence from an indexed document that contains a
stopword, this document will not be found by solr.
The enablePositionIncrements="true" only is (AFAIU) for querying
phrases, but not for my case of "word OR word*" queries.
So, what should I do? Is there a better filter combination that I could
try? Or am I doing something wrong conceptually? The only solution that
I have found working is to not use stopword filtering at all.
Greetings,
Gert