Re: Keyword extraction

Aleksander M. Stensby Wed, 26 Nov 2008 04:04:28 -0800

I do not agree with you at all. The concept of MoreLikeThis is based onthe fundamental idea of TF-IDF weighting, and not term frequency alone.Please take a look at:http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.htmlAs you can see, it is possible to use cut-off thresholds to significantlyreduce the number of unimportant terms, and generate highly suitablequeries based on the tf-idf frequency of the term, since as you point out,high frequency terms alone tends to be useless for querying, but takingthe document frequency into account drastically increases the importanceof the term!

In solr, use parameters to manipulate your desired results:http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c

For instance:

mlt.mintf - Minimum Term Frequency - the frequency below which terms willbe ignored in the source doc.mlt.mindf - Minimum Document Frequency - the frequency at which words willbe ignored which do not occur in at least this many docs.

You can also set thresholds for term length etc.


Hope this gives you a better idea of things.
- Aleks

On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]>wrote:

Dear Partick, I had the same problem with MoreLikeThis function.
After briefly reading and analyzing the source code of moreLikeThisfunction in solr, I conducted:
MoreLikeThis uses term vectors to ranks all the terms from a document
by its frequency. According to its ranking, it will start to generate
queries, artificially, and search for documents.
So, moreLikeThis will retrieve related documents by artificiallygenerating queries based on most frequent terms.
There's a big problem with "most frequent terms" from documents. Mostfrequent words are usually meaningless, or so called function words, or,people from Information Retrieval like to call them stopwords. However,ignoring technical problems of implementation of moreLikeThis function,this approach is very dangerous, since queries are generatedartificially based on a given document.Writting queries for retrieving a document is a human task, and itassumes some knowledge (user knows what document he wants).
I advice to use others approaches, depending on your expectation. Forexample, you can extract similar documents just by searching fordocuments with similar title (more like this doesn't work in this case).
I hope it helps,
Best Regards,
Vitalie Scurtu
--- On Wed, 11/26/08, Plaatje, Patrick <[EMAIL PROTECTED]>wrote:
From: Plaatje, Patrick <[EMAIL PROTECTED]>
Subject: RE:  Keyword extraction
To: [email protected]
Date: Wednesday, November 26, 2008, 10:52 AM

Hi All,
as an addition to my previous post, no interestingTerms are returned
when i execute the folowing url:
http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.interes
tingTerms=list&mlt=true&mlt.match.include=true
I get a moreLikeThis list though, any thoughts?
Best,
Patrick




--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

Re: Keyword extraction

Reply via email to