I haven't researched old versions of Lucene, but I think it has always been
a vector space, tf.idf engine. I don't see any hint of probabilistic
scoring.

A bit of background about stop words and idf. They are two versions of the
same thing.

Stop words are a manual, on/off decision about what words are important.
That decision is high risk and easy to get wrong. We have a movie titled
"To be and to have". Oops.

Inverse document frequency (idf) replaces that on/off control with a
proportional weight calculated from the index. For Netflix, that means
that "weeds: season 2" has a high weight for "weeds" and lower weights
for "season" and "2".

In my control theory course, my professor told me to only use proportional
control when on/off didn't work. Well, stop words don't work and idf does.

For a longer list of movie titles entirely made of stop words, go here:

http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

wunder

On 7/16/09 8:50 AM, "Daniel Alheiros" <daniel.alhei...@bbc.co.uk> wrote:

> Hi Walter,
> 
> Has it always been there? Which version of Lucene are we talking about?
> 
> Regards,
> Daniel
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wunderw...@netflix.com]
> Sent: 16 July 2009 15:04
> To: solr-user@lucene.apache.org
> Subject: Re: Word frequency count in the index
> 
> Lucene uses a tf.idf relevance formula, so it automatically finds common
> words (stop words) in your documents and gives them lower weight. I
> recommend not removing stop words at all and letting Lucene handle the
> weighting.
> 
> wunder
> 
> On 7/16/09 3:29 AM, "Pooja Verlani" <pooja.verl...@gmail.com> wrote:
> 
>> Hi,
>> 
>> Is there any way in SOLR to know the count of each word indexed in the
> 
>> solr ?
>> I want to find out the different word frequencies to figure out '
>> application specific stop words'.
>> 
>> Please let me know if its possible.
>> 
>> Thank you,
>> Regards,
>> Pooja
> 
> 
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain personal
> views which are not the views of the BBC unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in reliance on
> it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
> 

Reply via email to