-----Original message----- > From:Furkan KAMACI <furkankam...@gmail.com> > Sent: Sunday 22nd September 2013 21:15 > To: solr-user@lucene.apache.org > Subject: Re: Near Duplicate Document Detection at Solr > > I've also know that there is another mechanism at Solr: > http://wiki.apache.org/solr/Deduplication I think that I should add a > custom signature because that is the most usable one for me: > http://wiki.apache.org/solr/TextProfileSignature
Keep in mind, its results are really bad for short documents and does not work for languages not using whitespace. > On the other hand are > there any limitation for deduplication at SolrCloud? Yes, it does not work: https://issues.apache.org/jira/browse/SOLR-3473 > > What do you think? > > > 2013/9/22 Furkan KAMACI <furkankam...@gmail.com> > > > I want to detect near duplicate documents (for web documents). I know that > > there is an algorithm called Winnowing and there is another technique used > > by Google. However I also know that Solr has a component called > > MoreLikeThis. Google's page explains that *mirroring and plagiarism* is > > easy to detect but near duplicate detection is much more behind it. > > > > So I want to ask that what is the underlying algorithm Solr MoreLikeThis > > component uses and can I use it for such kind of purposes? > > > > Otherwise, I will implement an algorithm for near duplicate document > > detection within few days and I will be proud to contribute and adopt it > > into Solr. > > > > Thanks; > > Furkan KAMACI > > >