Re: Searching for the '+' character

Erick Erickson Mon, 14 Sep 2009 09:45:29 -0700

Before you go too much further with this, I've just got to ask whetherthe
use case for searching "product+" really serves your customers.
If you mess around with analyzers to make things include the "+",
what does that mean for "&"? "*"? "."? any other weird character
you can think of?


Would it be a bad thing for "product" to match "product+" and vice
versa? Would it be more or less confusing for your users to have "product"
FAIL to match "product+"?

Of course only you really know your problem space, but think carefully
about this issue before you take on the work of making "product+" work
because it'll inevitably be waaaay more work than you think. Imagine the
bug reports when "product&" fails to match "product+", both of which
fail to match "product"....

I'd also get a copy of Luke and look at the index to be sure what you
*think*
is in there is *actually* there. It'll also help you understand what
analyzers
do better.

Don't forget that using different analyzers when indexing and querying will
lead to...er..."interesting" results.

Best
Erick

On Mon, Sep 14, 2009 at 11:38 AM, Paul Forsyth <p...@ez.no> wrote:

> Thanks Ahmet,
>
> Thats excellent, thanks :) I may have to increase the gramsize to take into
> account other possible uses but i can now read around these filters to make
> the adjustments.
>
> With regard to WordDelimiterFilterFactory. Is there a way to place a
> delimiter on this filter to still get most of its functionality without it
> absorbing the + signs? Will i loose a lot of 'good' functionality by
> removing it? 'preserveOriginal' sounds promising and seems to work but is it
> a good idea to use this?
>
>
> On 14 Sep 2009, at 16:16, AHMET ARSLAN wrote:
>
>
>>
>> --- On Mon, 9/14/09, Paul Forsyth <p...@ez.no> wrote:
>>
>>  From: Paul Forsyth <p...@ez.no>
>>> Subject: Re: Searching for the '+' character
>>> To: solr-user@lucene.apache.org
>>> Date: Monday, September 14, 2009, 5:55 PM
>>> With words like 'product+' i'd expect
>>> a search for '+' to return results like any other character
>>> or word, so '+' would be found within 'product+' or similar
>>> text.
>>>
>>> I've tried removing the worddelimiter from the query
>>> analyzer, restarting and reindexing but i get the same
>>> result. Nothing is found. I assume one of the filters could
>>> be adjusted to keep the '+'.
>>>
>>> Weird thing is that i tried to remove all filters from the
>>> analyzer and i get the same result.
>>>
>>> Paul
>>>
>>
>> When you remove all filters '+' is kept, but still '+' won't match
>> 'product+'. Because you want to search inside a token.
>>
>> If + sign is always at the end of of your text, and you want to search
>> only last character of your text EdgeNGramFilterFactory can do that.
>> with the settings side="back" maxGramSize="1" minGramSize="1"
>>
>> The fieldType below will match '+' to 'product+'
>>
>> <fieldType name="textx" class="solr.TextField" positionIncrementGap="100">
>>     <analyzer type="index">
>>       <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="ISOLatin1AccentFilterFactory"/>
>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"/>
>>        <filter class="solr.EdgeNGramFilterFactory" side="back"
>> maxGramSize="1" minGramSize="1"/>
>>     </analyzer>
>>     <analyzer type="query">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="ISOLatin1AccentFilterFactory"/>
>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"/>
>>     </analyzer>
>>   </fieldType>
>>
>>
>> But this time 'product+' will be reduced to only '+'. You won't be able to
>> search it otherways for example product*. Along with the last character, if
>> you want to keep the original word it self you can set maxGramSize to 512.
>> By doing this token 'product+' will produce 8 tokens: (and query product* or
>> product+ will return it )
>>
>> + word
>> t+ word
>> ct+ word
>> uct+ word
>> duct+ word
>> oduct+ word
>> roduct+ word
>> product+ word
>>
>> If + sign can be anywhere inside the text you can use NGramTokenFilter.
>> Hope this helps.
>>
>>
>>
>>
> Best regards,
>
> Paul Forsyth
>
> mail: p...@ez.no
> skype: paulforsyth
>
>

Re: Searching for the '+' character

Reply via email to