AW: SnowballPorterFilterFactory and protected words

Hausherr, Jens Mon, 28 Jan 2008 04:23:22 -0800

Hi Daniele,

I do not know if the italian anaylzer also has this feature, but the german 
analyzer supports a list of words that are not stemmed in al file protwords.txt

Stopwords are propably the wrong direction because it would remove the 
configured words from indexing altogether...

@hoss: recall means the percentage of documents the search engine retrieves 
form a defined set of documents when compared to the same query "executed" by a 
human expert. So everybody loves high recall - everybody dislikes false hits :) 
(which the default stemming cited by Daniele results in)

Regards,
Jens

-----Ursprüngliche Nachricht-----
Von: Daniele Salvatico [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 28. Januar 2008 11:35
An: solr-user@lucene.apache.org
Betreff: Re: SnowballPorterFilterFactory and protected words

hossman wrote:
> 
> 
> I'm not sure i understand what you mwan.  Why would protected words 
> (in regards to the stemmer) reduce recall ? ... i guess it depends on 
> the words you are protecting right ... but why would you wnat to 
> reduce recall?  isn't the goal usually to increases recall while 
> keeping precision high?
> 
> (disclaimer: i'm not very smart when it comes to theoretical IR, i'm 
> more of a hands on "practicallist" .. i try stuff, i draw on past 
> experience to analyzer for decide if it's "better" and then i deploy 
> it and if my user satisfaction numbers go down i roll back.)
> 
> : It could be that a parallel approach using dismax boosting for 
> fields such
> : as "product name" and "category" will,  beside increasing precision, 
> also
> : reducing false hit recall?
> 
> Hmmm... i think it's safe to see that intellegent choice of qf, pf, 
> bf, and bq values (based on inherent knowledge of hte corpus) can 
> increase precision; but unless you use prohibitive fq clauses, i don't 
> know that you will actaully be reducing your false hit rate ... you're 
> just making their scores very small relative the top scoring docs.  a 
> strict "mm" is your best bet for reducing the number of "false hits" 
> (because things that don't match "enough" of the input terms will be 
> weeded out)
> 
> 
> 
> 
> -Hoss
> 
> 
> 

Yes, i'd like to protect some words (at least among the most queried) from 
being stemmed, but since this require some custom work at Lucene java class 
level (as said before for italian), i was looking for possibly alternative 
approaches.

A practical example: the words "sole" ("sun", but also "lonely") and the word 
"solo" ("only", but also "alone") stems in "sol". I'd like to protect "sole" 
from being stemmed. Maybe a solution would be to add the latter in the 
stopwords.txt?

I think i still have to tune and play with all the dismax paramethers and set 
as much strict as possible the "mm" as you said. I'm trying to have a good 
balancing of boosting options

Anyway in most of the cases Solr works very well.

I'll let u know!

--
View this message in context: 
http://www.nabble.com/SnowballPorterFilterFactory-and-protected-words-tp15042758p15132455.html
Sent from the Solr - User mailing list archive at Nabble.com.

This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.

AW: SnowballPorterFilterFactory and protected words

Reply via email to