Here is a paper that I found useful: http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI <furkankam...@gmail.com> wrote: > Thanks for your comments. > > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com> > >> if you need a specialized algorithm for detecting blogposts plagiarism / >> quotations (which are different tasks IMHO) I think you have 2 options: >> 1. implement a dedicated one based on your features / metrics / domain >> 2. try to fine tune an existing algorithm that is flexible enough >> >> If I were to do it with Solr I'd probably do something like: >> 1. index "original" blogposts in Solr (possibly using Jack's suggestion >> about ngrams / shingles) >> 2. do MLT queries with "candidate blogposts copies" text >> 3. get the first, say, 2-3 hits >> 4. mark it as quote / plagiarism >> 5. eventually train a classifier to help you mark other texts as quote / >> plagiarism >> >> HTH, >> Tommaso >> >> >> >> 2013/7/23 Furkan KAMACI <furkankam...@gmail.com> >> >> > Actually I need a specialized algorithm. I want to use that algorithm to >> > detect duplicate blog posts. >> > >> > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com> >> > >> > > Hi, >> > > >> > > I you may leverage and / or improve MLT component [1]. >> > > >> > > HTH, >> > > Tommaso >> > > >> > > [1] : http://wiki.apache.org/solr/MoreLikeThis >> > > >> > > >> > > 2013/7/23 Furkan KAMACI <furkankam...@gmail.com> >> > > >> > > > Hi; >> > > > >> > > > Sometimes a huge part of a document may exist in another document. As >> > > like >> > > > in student plagiarism or quotation of a blog post at another blog >> post. >> > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class >> > to >> > > > detect it? >> > > > >> > > >> > >>