Re: Document Similarity Algorithm at Solr/Lucene

Shashi Kant Tue, 23 Jul 2013 08:08:59 -0700

Here is a paper that I found useful:
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf



On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI <furkankam...@gmail.com> wrote:
> Thanks for your comments.
>
> 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com>
>
>> if you need a specialized algorithm for detecting blogposts plagiarism /
>> quotations (which are different tasks IMHO) I think you have 2 options:
>> 1. implement a dedicated one based on your features / metrics / domain
>> 2. try to fine tune an existing algorithm that is flexible enough
>>
>> If I were to do it with Solr I'd probably do something like:
>> 1. index "original" blogposts in Solr (possibly using Jack's suggestion
>> about ngrams / shingles)
>> 2. do MLT queries with "candidate blogposts copies" text
>> 3. get the first, say, 2-3 hits
>> 4. mark it as quote / plagiarism
>> 5. eventually train a classifier to help you mark other texts as quote /
>> plagiarism
>>
>> HTH,
>> Tommaso
>>
>>
>>
>> 2013/7/23 Furkan KAMACI <furkankam...@gmail.com>
>>
>> > Actually I need a specialized algorithm. I want to use that algorithm to
>> > detect duplicate blog posts.
>> >
>> > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com>
>> >
>> > > Hi,
>> > >
>> > > I you may leverage and / or improve MLT component [1].
>> > >
>> > > HTH,
>> > > Tommaso
>> > >
>> > > [1] : http://wiki.apache.org/solr/MoreLikeThis
>> > >
>> > >
>> > > 2013/7/23 Furkan KAMACI <furkankam...@gmail.com>
>> > >
>> > > > Hi;
>> > > >
>> > > > Sometimes a huge part of a document may exist in another document. As
>> > > like
>> > > > in student plagiarism or quotation of a blog post at another blog
>> post.
>> > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class
>> > to
>> > > > detect it?
>> > > >
>> > >
>> >
>>

Re: Document Similarity Algorithm at Solr/Lucene

Reply via email to