--- On Mon, 9/14/09, Paul Forsyth <p...@ez.no> wrote:

> From: Paul Forsyth <p...@ez.no>
> Subject: Re: Searching for the '+' character
> To: solr-user@lucene.apache.org
> Date: Monday, September 14, 2009, 5:55 PM
> With words like 'product+' i'd expect
> a search for '+' to return results like any other character
> or word, so '+' would be found within 'product+' or similar
> text.
> 
> I've tried removing the worddelimiter from the query
> analyzer, restarting and reindexing but i get the same
> result. Nothing is found. I assume one of the filters could
> be adjusted to keep the '+'.
> 
> Weird thing is that i tried to remove all filters from the
> analyzer and i get the same result.
> 
> Paul

When you remove all filters '+' is kept, but still '+' won't match 'product+'. 
Because you want to search inside a token.

If + sign is always at the end of of your text, and you want to search only 
last character of your text EdgeNGramFilterFactory can do that.
with the settings side="back" maxGramSize="1" minGramSize="1"

The fieldType below will match '+' to 'product+'

<fieldType name="textx" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="ISOLatin1AccentFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.EdgeNGramFilterFactory" side="back" maxGramSize="1" 
minGramSize="1"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>      
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="ISOLatin1AccentFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
      </analyzer>
    </fieldType>


But this time 'product+' will be reduced to only '+'. You won't be able to 
search it otherways for example product*. Along with the last character, if you 
want to keep the original word it self you can set maxGramSize to 512. By doing 
this token 'product+' will produce 8 tokens: (and query product* or product+ 
will return it )

+ word
t+ word
ct+ word
uct+ word
duct+ word
oduct+ word
roduct+ word
product+ word

If + sign can be anywhere inside the text you can use NGramTokenFilter.
Hope this helps. 


      

Reply via email to