Hi Erick,
In this specific case my client does have a new product with a '+' at
the end. Its just one of those odd ones!
Customers are expected to put + into the search box so i have to have
results to show.
I hear your concerns though. Originally i thought I would need to
transform the + into something else, and do this back and forwards to
get a match!
Hopefully this will be a standard solr install, but with this tweak
for escaped chars....
Paul
On 14 Sep 2009, at 17:01, Erick Erickson wrote:
Before you go too much further with this, I've just got to ask
whetherthe
use case for searching "product+" really serves your customers.
If you mess around with analyzers to make things include the "+",
what does that mean for "&"? "*"? "."? any other weird character
you can think of?
Would it be a bad thing for "product" to match "product+" and vice
versa? Would it be more or less confusing for your users to have
"product"
FAIL to match "product+"?
Of course only you really know your problem space, but think carefully
about this issue before you take on the work of making "product+" work
because it'll inevitably be waaaay more work than you think. Imagine
the
bug reports when "product&" fails to match "product+", both of which
fail to match "product"....
I'd also get a copy of Luke and look at the index to be sure what you
*think*
is in there is *actually* there. It'll also help you understand what
analyzers
do better.
Don't forget that using different analyzers when indexing and
querying will
lead to...er..."interesting" results.
Best
Erick
On Mon, Sep 14, 2009 at 11:38 AM, Paul Forsyth <p...@ez.no> wrote:
Thanks Ahmet,
Thats excellent, thanks :) I may have to increase the gramsize to
take into
account other possible uses but i can now read around these filters
to make
the adjustments.
With regard to WordDelimiterFilterFactory. Is there a way to place a
delimiter on this filter to still get most of its functionality
without it
absorbing the + signs? Will i loose a lot of 'good' functionality by
removing it? 'preserveOriginal' sounds promising and seems to work
but is it
a good idea to use this?
On 14 Sep 2009, at 16:16, AHMET ARSLAN wrote:
--- On Mon, 9/14/09, Paul Forsyth <p...@ez.no> wrote:
From: Paul Forsyth <p...@ez.no>
Subject: Re: Searching for the '+' character
To: solr-user@lucene.apache.org
Date: Monday, September 14, 2009, 5:55 PM
With words like 'product+' i'd expect
a search for '+' to return results like any other character
or word, so '+' would be found within 'product+' or similar
text.
I've tried removing the worddelimiter from the query
analyzer, restarting and reindexing but i get the same
result. Nothing is found. I assume one of the filters could
be adjusted to keep the '+'.
Weird thing is that i tried to remove all filters from the
analyzer and i get the same result.
Paul
When you remove all filters '+' is kept, but still '+' won't match
'product+'. Because you want to search inside a token.
If + sign is always at the end of of your text, and you want to
search
only last character of your text EdgeNGramFilterFactory can do that.
with the settings side="back" maxGramSize="1" minGramSize="1"
The fieldType below will match '+' to 'product+'
<fieldType name="textx" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="ISOLatin1AccentFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="English"/>
<filter class="solr.EdgeNGramFilterFactory" side="back"
maxGramSize="1" minGramSize="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="ISOLatin1AccentFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="English"/>
</analyzer>
</fieldType>
But this time 'product+' will be reduced to only '+'. You won't be
able to
search it otherways for example product*. Along with the last
character, if
you want to keep the original word it self you can set maxGramSize
to 512.
By doing this token 'product+' will produce 8 tokens: (and query
product* or
product+ will return it )
+ word
t+ word
ct+ word
uct+ word
duct+ word
oduct+ word
roduct+ word
product+ word
If + sign can be anywhere inside the text you can use
NGramTokenFilter.
Hope this helps.
Best regards,
Paul Forsyth
mail: p...@ez.no
skype: paulforsyth
Best regards,
Paul Forsyth
mail: p...@ez.no
skype: paulforsyth