One classic approach is to simply use the full text of the suspect text as well as bigrams and trigrams (phrases) from that text with "OR" operators. The top results will be the documents that most closely "match" the subject text. That provides a visual set similar results. You will then have to apply some heuristic of your own as far as how many top results to look at or what score to cut off at. The use of "OR" operators assures that similar documents will be found even if not 100% of the words are used. Yes, "OR" guarantees that your total result count will be high, but scoring assures that the top results will be more relevant.

-- Jack Krupansky

-----Original Message----- From: Furkan KAMACI
Sent: Tuesday, July 23, 2013 6:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Document Similarity Algorithm at Solr/Lucene

Actually I need a specialized algorithm. I want to use that algorithm to
detect duplicate blog posts.

2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com>

Hi,

I you may leverage and / or improve MLT component [1].

HTH,
Tommaso

[1] : http://wiki.apache.org/solr/MoreLikeThis


2013/7/23 Furkan KAMACI <furkankam...@gmail.com>

> Hi;
>
> Sometimes a huge part of a document may exist in another document. As
like
> in student plagiarism or quotation of a blog post at another blog post.
> Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
> detect it?
>


Reply via email to