Re: Keyword extraction

Aleksander M. Stensby Wed, 26 Nov 2008 05:44:13 -0800

I'm sure that for certain problems and cases you will need to do quite abit tweaking to make it work (to suite your needs), but i responded toyour statement because you made it sound like the MoreLikeThis componentdoes not work at all for its purpuse, while it actually do work asintended and can be of great aid in constructing queries to retrievesame-topic-documents etc.


- Aleksander

On Wed, 26 Nov 2008 14:10:57 +0100, Scurtu Vitalie <[EMAIL PROTECTED]>wrote:

Yes, I totally understand, and agree. 
MoreLikeThis uses TF-IDF to rank terms, then it generates queries basedon top ranked terms. In any case, I wasn't able to make it work aftermany attempts.
Finally, I've used a different method for queries generation, and itworks better, or at least gives some results, while with moreLikeThisresults were poor or no result at all.
To mention that my index was composed by short length documents,therefore the intersection between top ranked terms by TF-IDF was emptyset. MoreLikeThis works better when you have long documents.
Yes, I've changed the thresholds for min TFIDF and max TFIDF, and othersparameters.
I've also used "mlt.maxqt" parameter to increase the number of termsused in queries generation, but still didn't work well, since the methodof queries generation based on terms with the highest TF-IDF scoredoesn't generate representative query for document. I wasn't able totune it. For a low value such as mlt.maxqt=3,4, results were poor, whilefor mlt.maxqt=5,6>>> it gave too many and irrelevant results.
Thank you,
Best Wishes,
Vitalie Scurtu
--- On Wed, 11/26/08, Aleksander M. Stensby<[EMAIL PROTECTED]> wrote:
From: Aleksander M. Stensby <aleksander.
[EMAIL PROTECTED]>
Subject: Re:  Keyword extraction
To: solr-user@lucene.apache.org
Date: Wednesday, November 26, 2008, 1:03 PM
I do not agree with you at all. The concept of MoreLikeThis is based onthe
fundamental idea of TF-IDF weighting, and not term frequency alone.
Please take a look at:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html
As you can see, it is possible to use cut-off thresholds to significantly
reduce the number of unimportant terms, and generate highly suitablequeries
based on the tf-idf frequency of the term, since as you point out, high
frequency terms alone tends to be useless for querying, but taking thedocument
frequency into account drastically increases the importance of the term!

In solr, use parameters to manipulate your desired results:
http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c
For instance:
mlt.mintf - Minimum Term Frequency - the frequency below which termswill be
ignored in the source doc.
mlt.mindf - Minimum Document Frequency - the frequency at which wordswill be
ignored which do not occur in at least this many docs.
You can also set thresholds for term length etc.

Hope this gives you a better idea of things.
- Aleks

On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]>
wrote:
Dear Partick, I had the same problem with MoreLikeThis function.

After  briefly reading and analyzing the source code of moreLikeThis
function in solr, I conducted:
MoreLikeThis uses term vectors to ranks all the terms from a document
by its frequency. According to its ranking, it will start to generate
queries, artificially, and search for documents.

So, moreLikeThis will retrieve related documents by artificially
generating queries based on most frequent terms.
There's a big problem with "most frequent terms"  from
documents. Most frequent words are usually meaningless, or so calledfunction
words, or, people from Information Retrieval like to call them stopwords.
However, ignoring  technical problems of implementation of moreLikeThis
function, this approach is very dangerous, since queries are generated
artificially based on a given document.
Writting queries for retrieving a document is a human task, and itassumes
some knowledge (user knows what document he wants).
I advice to use others approaches, depending on your expectation. For
example, you can extract similar documents just by searching fordocuments with
similar title (more like this doesn't work in this case).
I hope it helps,
Best Regards,
Vitalie Scurtu
--- On Wed, 11/26/08, Plaatje, Patrick
<[EMAIL PROTECTED]> wrote:
From: Plaatje, Patrick <[EMAIL PROTECTED]>
Subject: RE:  Keyword extraction
To: solr-user@lucene.apache.org
Date: Wednesday, November 26, 2008, 10:52 AM

Hi All,
as an addition to my previous post, no interestingTerms are returned
when i execute the folowing url:
http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.interes
tingTerms=list&mlt=true&mlt.match.include=true
I get a moreLikeThis list though, any thoughts?
Best,
Patrick
--Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

Re: Keyword extraction

Reply via email to