I haven't researched old versions of Lucene, but I think it has always been a vector space, tf.idf engine. I don't see any hint of probabilistic scoring.
A bit of background about stop words and idf. They are two versions of the same thing. Stop words are a manual, on/off decision about what words are important. That decision is high risk and easy to get wrong. We have a movie titled "To be and to have". Oops. Inverse document frequency (idf) replaces that on/off control with a proportional weight calculated from the index. For Netflix, that means that "weeds: season 2" has a high weight for "weeds" and lower weights for "season" and "2". In my control theory course, my professor told me to only use proportional control when on/off didn't work. Well, stop words don't work and idf does. For a longer list of movie titles entirely made of stop words, go here: http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html wunder On 7/16/09 8:50 AM, "Daniel Alheiros" <daniel.alhei...@bbc.co.uk> wrote: > Hi Walter, > > Has it always been there? Which version of Lucene are we talking about? > > Regards, > Daniel > > -----Original Message----- > From: Walter Underwood [mailto:wunderw...@netflix.com] > Sent: 16 July 2009 15:04 > To: solr-user@lucene.apache.org > Subject: Re: Word frequency count in the index > > Lucene uses a tf.idf relevance formula, so it automatically finds common > words (stop words) in your documents and gives them lower weight. I > recommend not removing stop words at all and letting Lucene handle the > weighting. > > wunder > > On 7/16/09 3:29 AM, "Pooja Verlani" <pooja.verl...@gmail.com> wrote: > >> Hi, >> >> Is there any way in SOLR to know the count of each word indexed in the > >> solr ? >> I want to find out the different word frequencies to figure out ' >> application specific stop words'. >> >> Please let me know if its possible. >> >> Thank you, >> Regards, >> Pooja > > > http://www.bbc.co.uk/ > This e-mail (and any attachments) is confidential and may contain personal > views which are not the views of the BBC unless specifically stated. > If you have received it in error, please delete it from your system. > Do not use, copy or disclose the information in any way nor act in reliance on > it and notify the sender immediately. > Please note that the BBC monitors e-mails sent or received. > Further communication will signify your consent to this. >