Re: Keyword extraction

Scurtu Vitalie Wed, 26 Nov 2008 06:49:46 -0800

Sorry for not writing clearly. 

Yes, it works good for its purpose, and I didn't want to say that moreLikeThis 
component does not work at all.


In the same time it's good to know what are the limitations and the problems of 
moreLikeThis function. 

What I want to point out is that queries_generation is one of fundamental 
problems in Information Retrieval, and independent of the implementation of 
moreLikeThis function, it can give inappropriate results. 

Best Wishes,
Vitalie Scurtu


--- On Wed, 11/26/08, Aleksander M. Stensby <[EMAIL PROTECTED]> wrote:
From: Aleksander M. Stensby <[EMAIL PROTECTED]>
Subject: Re:  Keyword extraction
To: solr-user@lucene.apache.org
Date: Wednesday, November 26, 2008, 2:43 PM


I'm sure that for certain problems and cases you will need to do quite a bit
tweaking to make it work (to suite your needs), but i responded to your
statement because you made it sound like the MoreLikeThis component does not
work at all for its purpuse, while it actually do work as intended and can be of
great aid in constructing queries to retrieve same-topic-documents etc.

- Aleksander

On Wed, 26 Nov 2008 14:10:57 +0100, Scurtu Vitalie <[EMAIL PROTECTED]>
wrote:

> Yes, I totally understand, and agree. 
> 
> MoreLikeThis uses TF-IDF to rank terms, then it generates queries based on
top ranked terms.  In any case, I wasn't able to make it work after many
attempts.
> 
> Finally, I've used a different method for queries generation, and it
works better, or at least gives some results, while with moreLikeThis results
were poor or no result at all.
> 
> To mention that my index was composed by short length documents, therefore
the intersection between top ranked terms by TF-IDF was empty set. 
MoreLikeThis works better when you have long documents.
> 
> Yes, I've changed the thresholds for min TFIDF and max TFIDF, and
others parameters.
> 
> I've also used "mlt.maxqt" parameter  to increase the
number of terms used in queries generation, but still didn't work well,
since the method of queries generation based on terms with the highest TF-IDF
score doesn't generate representative query for document.  I wasn't
able to tune it. For a low value such as mlt.maxqt=3,4, results were poor, while
for mlt.maxqt=5,6>>> it gave too many and irrelevant results.
> 
> 
> 
> Thank you,
> Best Wishes,
> Vitalie Scurtu
> 
> 
> 
> --- On Wed, 11/26/08, Aleksander M. Stensby
<[EMAIL PROTECTED]> wrote:
> From: Aleksander M. Stensby <aleksander.
> [EMAIL PROTECTED]>
> Subject: Re:  Keyword extraction
> To: solr-user@lucene.apache.org
> Date: Wednesday, November 26, 2008, 1:03 PM
> 
> I do not agree with you at all. The concept of MoreLikeThis is based on
the
> fundamental idea of TF-IDF weighting, and not term frequency alone.
> Please take a look at:
>
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html
> As you can see, it is possible to use cut-off thresholds to significantly
> reduce the number of unimportant terms, and generate highly suitable
queries
> based on the tf-idf frequency of the term, since as you point out, high
> frequency terms alone tends to be useless for querying, but taking the
document
> frequency into account drastically increases the importance of the term!
> 
> In solr, use parameters to manipulate your desired results:
>
http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c
> For instance:
> mlt.mintf - Minimum Term Frequency - the frequency below which terms will
be
> ignored in the source doc.
> mlt.mindf - Minimum Document Frequency - the frequency at which words will
be
> ignored which do not occur in at least this many docs.
> You can also set thresholds for term length etc.
> 
> Hope this gives you a better idea of things.
> - Aleks
> 
> On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie
<[EMAIL PROTECTED]>
> wrote:
> 
>> Dear Partick, I had the same problem with MoreLikeThis function.
>> 
>> After  briefly reading and analyzing the source code of moreLikeThis
> function in solr, I conducted:
>> 
>> MoreLikeThis uses term vectors to ranks all the terms from a document
>> by its frequency. According to its ranking, it will start to generate
>> queries, artificially, and search for documents.
>> 
>> So, moreLikeThis will retrieve related documents by artificially
> generating queries based on most frequent terms.
>> 
>> There's a big problem with "most frequent terms"  from
> documents. Most frequent words are usually meaningless, or so called
function
> words, or, people from Information Retrieval like to call them stopwords.
> However, ignoring  technical problems of implementation of moreLikeThis
> function, this approach is very dangerous, since queries are generated
> artificially based on a given document.
>> Writting queries for retrieving a document is a human task, and it
assumes
> some knowledge (user knows what document he wants).
>> 
>> I advice to use others approaches, depending on your expectation. For
> example, you can extract similar documents just by searching for documents
with
> similar title (more like this doesn't work in this case).
>> 
>> I hope it helps,
>> Best Regards,
>> Vitalie Scurtu
>> --- On Wed, 11/26/08, Plaatje, Patrick
> <[EMAIL PROTECTED]> wrote:
>> From: Plaatje, Patrick <[EMAIL PROTECTED]>
>> Subject: RE:  Keyword extraction
>> To: solr-user@lucene.apache.org
>> Date: Wednesday, November 26, 2008, 10:52 AM
>> 
>> Hi All,
>> as an addition to my previous post, no interestingTerms are returned
>> when i execute the folowing url:
>> 
>
http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.interes
>> tingTerms=list&mlt=true&mlt.match.include=true
>> I get a moreLikeThis list though, any thoughts?
>> Best,
>> Patrick
>> 
>> 
>> 
>> 
> 
> 
> 
> --Aleksander M. Stensby
> Senior software developer
> Integrasco A/S
> www.integrasco.no
>

Re: Keyword extraction

Reply via email to