Thanks for your comments. 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com>
> if you need a specialized algorithm for detecting blogposts plagiarism / > quotations (which are different tasks IMHO) I think you have 2 options: > 1. implement a dedicated one based on your features / metrics / domain > 2. try to fine tune an existing algorithm that is flexible enough > > If I were to do it with Solr I'd probably do something like: > 1. index "original" blogposts in Solr (possibly using Jack's suggestion > about ngrams / shingles) > 2. do MLT queries with "candidate blogposts copies" text > 3. get the first, say, 2-3 hits > 4. mark it as quote / plagiarism > 5. eventually train a classifier to help you mark other texts as quote / > plagiarism > > HTH, > Tommaso > > > > 2013/7/23 Furkan KAMACI <furkankam...@gmail.com> > > > Actually I need a specialized algorithm. I want to use that algorithm to > > detect duplicate blog posts. > > > > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com> > > > > > Hi, > > > > > > I you may leverage and / or improve MLT component [1]. > > > > > > HTH, > > > Tommaso > > > > > > [1] : http://wiki.apache.org/solr/MoreLikeThis > > > > > > > > > 2013/7/23 Furkan KAMACI <furkankam...@gmail.com> > > > > > > > Hi; > > > > > > > > Sometimes a huge part of a document may exist in another document. As > > > like > > > > in student plagiarism or quotation of a blog post at another blog > post. > > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class > > to > > > > detect it? > > > > > > > > > >