I've also know that there is another mechanism at Solr:
http://wiki.apache.org/solr/Deduplication I think that I should add a
custom signature because that is the most usable one for me:
http://wiki.apache.org/solr/TextProfileSignature On the other hand are
there any limitation for deduplication at SolrCloud?

What do you think?


2013/9/22 Furkan KAMACI <furkankam...@gmail.com>

> I want to detect near duplicate documents (for web documents). I know that
> there is an algorithm called Winnowing and there is another technique used
> by Google. However I also know that Solr has a component called
> MoreLikeThis. Google's page explains that *mirroring and plagiarism* is
> easy to detect but near duplicate detection is much more behind it.
>
> So I want to ask that what is the underlying algorithm Solr MoreLikeThis
> component uses and can I use it for such kind of purposes?
>
> Otherwise, I will implement an algorithm for near duplicate document
> detection within few days and I will be proud to contribute and adopt it
> into Solr.
>
> Thanks;
> Furkan KAMACI
>

Reply via email to