On 7/23/2013 3:33 AM, Furkan KAMACI wrote: > Sometimes a huge part of a document may exist in another document. As like > in student plagiarism or quotation of a blog post at another blog post. > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to > detect it?
Solr is designed for search, not heavy analysis. It might be possible, as Tommaso suggested, to take the MoreLikeThis functionality from Solr and adapt it to this use case, but this isn't really something Solr was designed to do. If you did use MoreLikeThis out of the box, the most it could do is show you similar documents to a specific document, but then you'd have to do your own actual comparison. Solr would not be able to tell you whether it's copied, just that it's similar. Also, it would not be able to easily and quickly do a full comparison across a huge number of documents. You'd be much better off with a tool specifically designed for the purpose. Perhaps Solr's MoreLikeThis capability would be something you could use in creating such a tool, but I couldn't say. Thanks, Shawn