Re: Searching for the '+' character

Paul Forsyth Mon, 14 Sep 2009 09:46:33 -0700

Hi Erick,

In this specific case my client does have a new product with a '+' atthe end. Its just one of those odd ones!

Customers are expected to put + into the search box so i have to haveresults to show.

I hear your concerns though. Originally i thought I would need totransform the + into something else, and do this back and forwards toget a match!

Hopefully this will be a standard solr install, but with this tweakfor escaped chars....


Paul

On 14 Sep 2009, at 17:01, Erick Erickson wrote:

Before you go too much further with this, I've just got to askwhetherthe

use case for searching "product+" really serves your customers.
If you mess around with analyzers to make things include the "+",
what does that mean for "&"? "*"? "."? any other weird character
you can think of?

Would it be a bad thing for "product" to match "product+" and vice

versa? Would it be more or less confusing for your users to have"product"

FAIL to match "product+"?

Of course only you really know your problem space, but think carefully
about this issue before you take on the work of making "product+" work

because it'll inevitably be waaaay more work than you think. Imaginethe

bug reports when "product&" fails to match "product+", both of which
fail to match "product"....

I'd also get a copy of Luke and look at the index to be sure what you
*think*
is in there is *actually* there. It'll also help you understand what
analyzers
do better.

Don't forget that using different analyzers when indexing andquerying will

lead to...er..."interesting" results.

Best
Erick

On Mon, Sep 14, 2009 at 11:38 AM, Paul Forsyth <p...@ez.no> wrote:

Thanks Ahmet,

Thats excellent, thanks :) I may have to increase the gramsize totake intoaccount other possible uses but i can now read around these filtersto make

the adjustments.

With regard to WordDelimiterFilterFactory. Is there a way to place a

delimiter on this filter to still get most of its functionalitywithout it

absorbing the + signs? Will i loose a lot of 'good' functionality by

removing it? 'preserveOriginal' sounds promising and seems to workbut is it

a good idea to use this?


On 14 Sep 2009, at 16:16, AHMET ARSLAN wrote:


--- On Mon, 9/14/09, Paul Forsyth <p...@ez.no> wrote:

From: Paul Forsyth <p...@ez.no>

Subject: Re: Searching for the '+' character
To: solr-user@lucene.apache.org
Date: Monday, September 14, 2009, 5:55 PM
With words like 'product+' i'd expect
a search for '+' to return results like any other character
or word, so '+' would be found within 'product+' or similar
text.

I've tried removing the worddelimiter from the query
analyzer, restarting and reindexing but i get the same
result. Nothing is found. I assume one of the filters could
be adjusted to keep the '+'.

Weird thing is that i tried to remove all filters from the
analyzer and i get the same result.

Paul


When you remove all filters '+' is kept, but still '+' won't match
'product+'. Because you want to search inside a token.

If + sign is always at the end of of your text, and you want tosearch

only last character of your text EdgeNGramFilterFactory can do that.
with the settings side="back" maxGramSize="1" minGramSize="1"

The fieldType below will match '+' to 'product+'

<fieldType name="textx" class="solr.TextField"positionIncrementGap="100">

   <analyzer type="index">
     <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="ISOLatin1AccentFilterFactory"/>
     <filter class="solr.SnowballPorterFilterFactory"
language="English"/>
      <filter class="solr.EdgeNGramFilterFactory" side="back"
maxGramSize="1" minGramSize="1"/>
   </analyzer>
   <analyzer type="query">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory"synonyms="synonyms.txt"

ignoreCase="true" expand="true"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="ISOLatin1AccentFilterFactory"/>
     <filter class="solr.SnowballPorterFilterFactory"
language="English"/>
   </analyzer>
 </fieldType>

But this time 'product+' will be reduced to only '+'. You won't beable tosearch it otherways for example product*. Along with the lastcharacter, ifyou want to keep the original word it self you can set maxGramSizeto 512.By doing this token 'product+' will produce 8 tokens: (and queryproduct* or

product+ will return it )

+ word
t+ word
ct+ word
uct+ word
duct+ word
oduct+ word
roduct+ word
product+ word

If + sign can be anywhere inside the text you can useNGramTokenFilter.

Hope this helps.

Best regards,

Paul Forsyth

mail: p...@ez.no
skype: paulforsyth


Best regards,

Paul Forsyth

mail: p...@ez.no
skype: paulforsyth

Re: Searching for the '+' character

Reply via email to