Hi Daniele, I do not know if the italian anaylzer also has this feature, but the german analyzer supports a list of words that are not stemmed in al file protwords.txt
Stopwords are propably the wrong direction because it would remove the configured words from indexing altogether... @hoss: recall means the percentage of documents the search engine retrieves form a defined set of documents when compared to the same query "executed" by a human expert. So everybody loves high recall - everybody dislikes false hits :) (which the default stemming cited by Daniele results in) Regards, Jens -----Ursprüngliche Nachricht----- Von: Daniele Salvatico [mailto:[EMAIL PROTECTED] Gesendet: Montag, 28. Januar 2008 11:35 An: solr-user@lucene.apache.org Betreff: Re: SnowballPorterFilterFactory and protected words hossman wrote: > > > I'm not sure i understand what you mwan. Why would protected words > (in regards to the stemmer) reduce recall ? ... i guess it depends on > the words you are protecting right ... but why would you wnat to > reduce recall? isn't the goal usually to increases recall while > keeping precision high? > > (disclaimer: i'm not very smart when it comes to theoretical IR, i'm > more of a hands on "practicallist" .. i try stuff, i draw on past > experience to analyzer for decide if it's "better" and then i deploy > it and if my user satisfaction numbers go down i roll back.) > > : It could be that a parallel approach using dismax boosting for > fields such > : as "product name" and "category" will, beside increasing precision, > also > : reducing false hit recall? > > Hmmm... i think it's safe to see that intellegent choice of qf, pf, > bf, and bq values (based on inherent knowledge of hte corpus) can > increase precision; but unless you use prohibitive fq clauses, i don't > know that you will actaully be reducing your false hit rate ... you're > just making their scores very small relative the top scoring docs. a > strict "mm" is your best bet for reducing the number of "false hits" > (because things that don't match "enough" of the input terms will be > weeded out) > > > > > -Hoss > > > Yes, i'd like to protect some words (at least among the most queried) from being stemmed, but since this require some custom work at Lucene java class level (as said before for italian), i was looking for possibly alternative approaches. A practical example: the words "sole" ("sun", but also "lonely") and the word "solo" ("only", but also "alone") stems in "sol". I'd like to protect "sole" from being stemmed. Maybe a solution would be to add the latter in the stopwords.txt? I think i still have to tune and play with all the dismax paramethers and set as much strict as possible the "mm" as you said. I'm trying to have a good balancing of boosting options Anyway in most of the cases Solr works very well. I'll let u know! -- View this message in context: http://www.nabble.com/SnowballPorterFilterFactory-and-protected-words-tp15042758p15132455.html Sent from the Solr - User mailing list archive at Nabble.com. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.