Hello,

I am having some problems with solr 1.4. I am indexing and querying data using the following fieldType:

    <fieldType name="text_de_de" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" 
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords_de_de.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LengthFilterFactory" min="2" max="200"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de_de.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" 
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"
          ignoreCase="true"
          words="stopwords_de_de.txt"
                enablePositionIncrements="true"
          />
        <filter class="solr.LengthFilterFactory" min="2" max="200"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

The application that is using solr does prepare the search string to filter out some dangerous characters like brackets and wildcards, etc, that otherwise might lead to a wrong query syntax.

All words are searched for as a normal word as well as a prefix. E.g.: "für solr" is converted by the application to
  (für OR für*) AND (solr OR solr*)

This works fine for normal words. But if you have a stopword like "für" in this example, the query will be stopword filtered by solr to something like this:
  (für*) AND (solr OR solr*)

The problem now is (as I think) that there is no "für*" anymore in the indexed data, because it was stopword filtered, too. If now someone copy&pastes a sentence from an indexed document that contains a stopword, this document will not be found by solr.

The enablePositionIncrements="true" only is (AFAIU) for querying phrases, but not for my case of "word OR word*" queries.

So, what should I do? Is there a better filter combination that I could try? Or am I doing something wrong conceptually? The only solution that I have found working is to not use stopword filtering at all.

Greetings,
Gert

Reply via email to