RE: Word frequency count in the index

Daniel Alheiros Mon, 20 Jul 2009 06:45:23 -0700

Hi Wunder,

Thanks for your reply!

I take your point. It has to be appropriate to your content... In the
cases I deal with, using stop words wouldn't be a big deal because the
documents we handle are usually a proper article (although titles could
still be impacted by it).

I based my stop words on the most frequent terms I could find on my
index when I indexed my whole database. I'm sure it will change over
time but itf would deal with the rest. I'm inclined to keep it like
this, but maybe some tests and real query analisys would be good. I will
let you know if any interesting patterns emerges.

Cheers,
Daniel 

-----Original Message-----
From: Walter Underwood [mailto:wunderw...@netflix.com] 
Sent: 16 July 2009 17:15
To: solr-user@lucene.apache.org
Subject: Re: Word frequency count in the index

I haven't researched old versions of Lucene, but I think it has always
been a vector space, tf.idf engine. I don't see any hint of
probabilistic scoring.

A bit of background about stop words and idf. They are two versions of
the same thing.

Stop words are a manual, on/off decision about what words are important.
That decision is high risk and easy to get wrong. We have a movie titled
"To be and to have". Oops.

Inverse document frequency (idf) replaces that on/off control with a
proportional weight calculated from the index. For Netflix, that means
that "weeds: season 2" has a high weight for "weeds" and lower weights
for "season" and "2".

In my control theory course, my professor told me to only use
proportional control when on/off didn't work. Well, stop words don't
work and idf does.

For a longer list of movie titles entirely made of stop words, go here:

http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

wunder

On 7/16/09 8:50 AM, "Daniel Alheiros" <daniel.alhei...@bbc.co.uk> wrote:

> Hi Walter,
> 
> Has it always been there? Which version of Lucene are we talking
about?
> 
> Regards,
> Daniel
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wunderw...@netflix.com]
> Sent: 16 July 2009 15:04
> To: solr-user@lucene.apache.org
> Subject: Re: Word frequency count in the index
> 
> Lucene uses a tf.idf relevance formula, so it automatically finds 
> common words (stop words) in your documents and gives them lower 
> weight. I recommend not removing stop words at all and letting Lucene 
> handle the weighting.
> 
> wunder
> 
> On 7/16/09 3:29 AM, "Pooja Verlani" <pooja.verl...@gmail.com> wrote:
> 
>> Hi,
>> 
>> Is there any way in SOLR to know the count of each word indexed in 
>> the
> 
>> solr ?
>> I want to find out the different word frequencies to figure out '
>> application specific stop words'.
>> 
>> Please let me know if its possible.
>> 
>> Thank you,
>> Regards,
>> Pooja
> 
> 
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain 
> personal views which are not the views of the BBC unless specifically
stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in 
> reliance on it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
> 

http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

RE: Word frequency count in the index

Reply via email to