I've also know that there is another mechanism at Solr: http://wiki.apache.org/solr/Deduplication I think that I should add a custom signature because that is the most usable one for me: http://wiki.apache.org/solr/TextProfileSignature On the other hand are there any limitation for deduplication at SolrCloud?
What do you think? 2013/9/22 Furkan KAMACI <furkankam...@gmail.com> > I want to detect near duplicate documents (for web documents). I know that > there is an algorithm called Winnowing and there is another technique used > by Google. However I also know that Solr has a component called > MoreLikeThis. Google's page explains that *mirroring and plagiarism* is > easy to detect but near duplicate detection is much more behind it. > > So I want to ask that what is the underlying algorithm Solr MoreLikeThis > component uses and can I use it for such kind of purposes? > > Otherwise, I will implement an algorithm for near duplicate document > detection within few days and I will be proud to contribute and adopt it > into Solr. > > Thanks; > Furkan KAMACI >