Hi Wunder, Thanks for your reply!
I take your point. It has to be appropriate to your content... In the cases I deal with, using stop words wouldn't be a big deal because the documents we handle are usually a proper article (although titles could still be impacted by it). I based my stop words on the most frequent terms I could find on my index when I indexed my whole database. I'm sure it will change over time but itf would deal with the rest. I'm inclined to keep it like this, but maybe some tests and real query analisys would be good. I will let you know if any interesting patterns emerges. Cheers, Daniel -----Original Message----- From: Walter Underwood [mailto:wunderw...@netflix.com] Sent: 16 July 2009 17:15 To: solr-user@lucene.apache.org Subject: Re: Word frequency count in the index I haven't researched old versions of Lucene, but I think it has always been a vector space, tf.idf engine. I don't see any hint of probabilistic scoring. A bit of background about stop words and idf. They are two versions of the same thing. Stop words are a manual, on/off decision about what words are important. That decision is high risk and easy to get wrong. We have a movie titled "To be and to have". Oops. Inverse document frequency (idf) replaces that on/off control with a proportional weight calculated from the index. For Netflix, that means that "weeds: season 2" has a high weight for "weeds" and lower weights for "season" and "2". In my control theory course, my professor told me to only use proportional control when on/off didn't work. Well, stop words don't work and idf does. For a longer list of movie titles entirely made of stop words, go here: http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html wunder On 7/16/09 8:50 AM, "Daniel Alheiros" <daniel.alhei...@bbc.co.uk> wrote: > Hi Walter, > > Has it always been there? Which version of Lucene are we talking about? > > Regards, > Daniel > > -----Original Message----- > From: Walter Underwood [mailto:wunderw...@netflix.com] > Sent: 16 July 2009 15:04 > To: solr-user@lucene.apache.org > Subject: Re: Word frequency count in the index > > Lucene uses a tf.idf relevance formula, so it automatically finds > common words (stop words) in your documents and gives them lower > weight. I recommend not removing stop words at all and letting Lucene > handle the weighting. > > wunder > > On 7/16/09 3:29 AM, "Pooja Verlani" <pooja.verl...@gmail.com> wrote: > >> Hi, >> >> Is there any way in SOLR to know the count of each word indexed in >> the > >> solr ? >> I want to find out the different word frequencies to figure out ' >> application specific stop words'. >> >> Please let me know if its possible. >> >> Thank you, >> Regards, >> Pooja > > > http://www.bbc.co.uk/ > This e-mail (and any attachments) is confidential and may contain > personal views which are not the views of the BBC unless specifically stated. > If you have received it in error, please delete it from your system. > Do not use, copy or disclose the information in any way nor act in > reliance on it and notify the sender immediately. > Please note that the BBC monitors e-mails sent or received. > Further communication will signify your consent to this. > http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.