One classic approach is to simply use the full text of the suspect text as
well as bigrams and trigrams (phrases) from that text with "OR" operators.
The top results will be the documents that most closely "match" the subject
text. That provides a visual set similar results. You will then have to
apply some heuristic of your own as far as how many top results to look at
or what score to cut off at. The use of "OR" operators assures that similar
documents will be found even if not 100% of the words are used. Yes, "OR"
guarantees that your total result count will be high, but scoring assures
that the top results will be more relevant.
-- Jack Krupansky
-----Original Message-----
From: Furkan KAMACI
Sent: Tuesday, July 23, 2013 6:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Document Similarity Algorithm at Solr/Lucene
Actually I need a specialized algorithm. I want to use that algorithm to
detect duplicate blog posts.
2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com>
Hi,
I you may leverage and / or improve MLT component [1].
HTH,
Tommaso
[1] : http://wiki.apache.org/solr/MoreLikeThis
2013/7/23 Furkan KAMACI <furkankam...@gmail.com>
> Hi;
>
> Sometimes a huge part of a document may exist in another document. As
like
> in student plagiarism or quotation of a blog post at another blog post.
> Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
> detect it?
>