Hmmm, I don't really see the problem here. I'll have to use English
examples...

Searching on the* (assuming the is a stopword) will search on
(them OR theory OR thespian) assuming those three words are in
your index. It will NOT search on the. So I think you're OK, or are
you seeing anomalous results?

Conceptually, the underlying lucene looks through your *existing* list of
terms for the field to assemble a clause containing the OR of all the
terms that match the wildcard. Since "the" isn't in the index, it doesn't
get included.

HTH
Erick

On Fri, May 28, 2010 at 11:25 AM, Gert Brinkmann <g...@netcologne.de> wrote:

>
> Hello,
>
> I am having some problems with solr 1.4. I am indexing and querying data
> using the following fieldType:
>
>     <fieldType name="text_de_de" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="stopwords_de_de.txt"
>>                enablePositionIncrements="true"
>>                />
>>        <filter class="solr.LengthFilterFactory" min="2" max="200"/>
>>        <filter class="solr.SnowballPorterFilterFactory" language="German"
>> />
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms_de_de.txt" ignoreCase="true" expand="true"/>
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.StopFilterFactory"
>>          ignoreCase="true"
>>          words="stopwords_de_de.txt"
>>                enablePositionIncrements="true"
>>          />
>>        <filter class="solr.LengthFilterFactory" min="2" max="200"/>
>>        <filter class="solr.SnowballPorterFilterFactory" language="German"
>> />
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>      </analyzer>
>>    </fieldType>
>>
>
> The application that is using solr does prepare the search string to filter
> out some dangerous characters like brackets and wildcards, etc, that
> otherwise might lead to a wrong query syntax.
>
> All words are searched for as a normal word as well as a prefix. E.g.: "für
> solr" is converted by the application to
>  (für OR für*) AND (solr OR solr*)
>
> This works fine for normal words. But if you have a stopword like "für" in
> this example, the query will be stopword filtered by solr to something like
> this:
>  (für*) AND (solr OR solr*)
>
> The problem now is (as I think) that there is no "für*" anymore in the
> indexed data, because it was stopword filtered, too. If now someone
> copy&pastes a sentence from an indexed document that contains a stopword,
> this document will not be found by solr.
>
> The enablePositionIncrements="true" only is (AFAIU) for querying phrases,
> but not for my case of "word OR word*" queries.
>
> So, what should I do? Is there a better filter combination that I could
> try? Or am I doing something wrong conceptually? The only solution that I
> have found working is to not use stopword filtering at all.
>
> Greetings,
> Gert
>
>

Reply via email to