Re: Document Similarity Algorithm at Solr/Lucene

Shawn Heisey Tue, 23 Jul 2013 06:58:36 -0700

On 7/23/2013 3:33 AM, Furkan KAMACI wrote:
> Sometimes a huge part of a document may exist in another document. As like
> in student plagiarism or quotation of a blog post at another blog post.
> Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
> detect it?


Solr is designed for search, not heavy analysis.  It might be possible,
as Tommaso suggested, to take the MoreLikeThis functionality from Solr
and adapt it to this use case, but this isn't really something Solr was
designed to do.

If you did use MoreLikeThis out of the box, the most it could do is show
you similar documents to a specific document, but then you'd have to do
your own actual comparison.  Solr would not be able to tell you whether
it's copied, just that it's similar.  Also, it would not be able to
easily and quickly do a full comparison across a huge number of documents.

You'd be much better off with a tool specifically designed for the
purpose.  Perhaps Solr's MoreLikeThis capability would be something you
could use in creating such a tool, but I couldn't say.

Thanks,
Shawn

Re: Document Similarity Algorithm at Solr/Lucene

Reply via email to