Re: Document Similarity Algorithm at Solr/Lucene

Tommaso Teofili Tue, 23 Jul 2013 07:27:35 -0700

if you need a specialized algorithm for detecting blogposts plagiarism /
quotations (which are different tasks IMHO) I think you have 2 options:
1. implement a dedicated one based on your features / metrics / domain
2. try to fine tune an existing algorithm that is flexible enough


If I were to do it with Solr I'd probably do something like:
1. index "original" blogposts in Solr (possibly using Jack's suggestion
about ngrams / shingles)
2. do MLT queries with "candidate blogposts copies" text
3. get the first, say, 2-3 hits
4. mark it as quote / plagiarism
5. eventually train a classifier to help you mark other texts as quote /
plagiarism

HTH,
Tommaso



2013/7/23 Furkan KAMACI <furkankam...@gmail.com>

> Actually I need a specialized algorithm. I want to use that algorithm to
> detect duplicate blog posts.
>
> 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com>
>
> > Hi,
> >
> > I you may leverage and / or improve MLT component [1].
> >
> > HTH,
> > Tommaso
> >
> > [1] : http://wiki.apache.org/solr/MoreLikeThis
> >
> >
> > 2013/7/23 Furkan KAMACI <furkankam...@gmail.com>
> >
> > > Hi;
> > >
> > > Sometimes a huge part of a document may exist in another document. As
> > like
> > > in student plagiarism or quotation of a blog post at another blog post.
> > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class
> to
> > > detect it?
> > >
> >
>

Re: Document Similarity Algorithm at Solr/Lucene

Reply via email to