Re: Document Similarity Algorithm at Solr/Lucene

Jack Krupansky Tue, 23 Jul 2013 06:54:01 -0700

One classic approach is to simply use the full text of the suspect text aswell as bigrams and trigrams (phrases) from that text with "OR" operators.The top results will be the documents that most closely "match" the subjecttext. That provides a visual set similar results. You will then have toapply some heuristic of your own as far as how many top results to look ator what score to cut off at. The use of "OR" operators assures that similardocuments will be found even if not 100% of the words are used. Yes, "OR"guarantees that your total result count will be high, but scoring assuresthat the top results will be more relevant.


-- Jack Krupansky

-----Original Message-----From: Furkan KAMACI

Sent: Tuesday, July 23, 2013 6:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Document Similarity Algorithm at Solr/Lucene

Actually I need a specialized algorithm. I want to use that algorithm to
detect duplicate blog posts.

2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com>

Hi,

I you may leverage and / or improve MLT component [1].

HTH,
Tommaso

[1] : http://wiki.apache.org/solr/MoreLikeThis


2013/7/23 Furkan KAMACI <furkankam...@gmail.com>

> Hi;
>
> Sometimes a huge part of a document may exist in another document. As
like
> in student plagiarism or quotation of a blog post at another blog post.
> Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
> detect it?
>

Re: Document Similarity Algorithm at Solr/Lucene

Reply via email to