On a general note, if you want to "really" understand how the MLT works, take a look at the wiki or read this thorough blog post: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
Regards, AleksanderOn Wed, 26 Nov 2008 14:41:52 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote:
Hi Aleksander,This was a typo on my end, the original query included a semicolon instead of an equal sign. But I think it has to do with my field not being stored and not being identified as termVectors="true". I'm recreating the index now, and see if this fixes the problem.Best, patrick -----Original Message----- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 14:37 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction Hi there!Well, first of all i think you have an error in your query, if I'm not mistaken.You say http://localhost:8080/solr/select/?q=id=18477975... but since you are referring to the field called "id", you must say: http://localhost:8080/solr/select/?q=id:18477975... (use colon instead of the equals sign). I think that will do the trick.If not, try adding the &debugQuery=on at the end of your request url, to see debug output on how the query is parsed and if/how any documents are matched against your query.Hope this helps. Cheers, AleksanderOn Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote:Hi Aleksander, Thanx for clearing this up. I am confident that this is a way to explore for me as I'm just starting to grasp the matter. Do you know why I'm not getting any results with the query posted earlier then? It gives me the folowing only: <lst name="moreLikeThis"> <result name="18477975" numFound="0" start="0"/> </lst> Instead of delivering details of the interestingTerms. Thanks in advance Patrick -----Original Message----- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 13:03 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction I do not agree with you at all. The concept of MoreLikeThis is basedon the fundamental idea of TF-IDF weighting, and not term frequency alone.Please take a look at: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simil ar/MoreLikeThis.html As you can see, it is possible to use cut-off thresholds to significantly reduce the number of unimportant terms, and generate highly suitable queries based on the tf-idf frequency of the term, since as you point out, high frequency terms alone tends to be useless for querying, but taking the document frequency into account drastically increases the importance of the term! In solr, use parameters to manipulate your desired results: http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e2 2ec5d1519c456b2c For instance: mlt.mintf - Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. mlt.mindf - Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. You can also set thresholds for term length etc. Hope this gives you a better idea of things. - Aleks On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> wrote:Dear Partick, I had the same problem with MoreLikeThis function. After briefly reading and analyzing the source code of moreLikeThis function in solr, I conducted: MoreLikeThis uses term vectors to ranks all the terms from a document by its frequency. According to its ranking, it will start to generate queries, artificially, and search for documents. So, moreLikeThis will retrieve related documents by artificially generating queries based on most frequent terms. There's a big problem with "most frequent terms" from documents. Most frequent words are usually meaningless, or so called functionwords, or, people from Information Retrieval like to call them stopwords.However, ignoring technical problems of implementation of moreLikeThis function, this approach is very dangerous, since queries are generated artificially based on a given document. Writting queries for retrieving a document is a human task, and it assumes some knowledge (user knows what document he wants). I advice to use others approaches, depending on your expectation. For example, you can extract similar documents just by searching fordocuments with similar title (more like this doesn't work in this case).I hope it helps, Best Regards, Vitalie Scurtu --- On Wed, 11/26/08, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: From: Plaatje, Patrick <[EMAIL PROTECTED]> Subject: RE: Keyword extraction To: solr-user@lucene.apache.org Date: Wednesday, November 26, 2008, 10:52 AM Hi All, as an addition to my previous post, no interestingTerms are returned when i execute the folowing url: http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.inte r es tingTerms=list&mlt=true&mlt.match.include=true I get a moreLikeThis list though, any thoughts? Best, Patrick