RE: Near Duplicate Document Detection at Solr

Markus Jelsma Sun, 22 Sep 2013 13:39:22 -0700

-----Original message-----
> From:Furkan KAMACI <furkankam...@gmail.com>
> Sent: Sunday 22nd September 2013 21:15
> To: solr-user@lucene.apache.org
> Subject: Re: Near Duplicate Document Detection at Solr
> 
> I've also know that there is another mechanism at Solr:
> http://wiki.apache.org/solr/Deduplication I think that I should add a
> custom signature because that is the most usable one for me:
> http://wiki.apache.org/solr/TextProfileSignature


Keep in mind, its results are really bad for short documents and does not work 
for languages not using whitespace.

> On the other hand are
> there any limitation for deduplication at SolrCloud?

Yes, it does not work:
https://issues.apache.org/jira/browse/SOLR-3473

> 
> What do you think?
> 
> 
> 2013/9/22 Furkan KAMACI <furkankam...@gmail.com>
> 
> > I want to detect near duplicate documents (for web documents). I know that
> > there is an algorithm called Winnowing and there is another technique used
> > by Google. However I also know that Solr has a component called
> > MoreLikeThis. Google's page explains that *mirroring and plagiarism* is
> > easy to detect but near duplicate detection is much more behind it.
> >
> > So I want to ask that what is the underlying algorithm Solr MoreLikeThis
> > component uses and can I use it for such kind of purposes?
> >
> > Otherwise, I will implement an algorithm for near duplicate document
> > detection within few days and I will be proud to contribute and adopt it
> > into Solr.
> >
> > Thanks;
> > Furkan KAMACI
> >
>

RE: Near Duplicate Document Detection at Solr

Reply via email to