RE: Near Duplicate Document Detection at Solr

2013-09-22 Thread Markus Jelsma
-Original message- > From:Furkan KAMACI > Sent: Sunday 22nd September 2013 21:15 > To: solr-user@lucene.apache.org > Subject: Re: Near Duplicate Document Detection at Solr > > I've also know that there is another mechanism at Solr: > http://wiki.apache.org/

Re: Near Duplicate Document Detection at Solr

2013-09-22 Thread Furkan KAMACI
I've also know that there is another mechanism at Solr: http://wiki.apache.org/solr/Deduplication I think that I should add a custom signature because that is the most usable one for me: http://wiki.apache.org/solr/TextProfileSignature On the other hand are there any limitation for deduplication at

Near Duplicate Document Detection at Solr

2013-09-22 Thread Furkan KAMACI
I want to detect near duplicate documents (for web documents). I know that there is an algorithm called Winnowing and there is another technique used by Google. However I also know that Solr has a component called MoreLikeThis. Google's page explains that *mirroring and plagiarism* is easy to detec