if you need a specialized algorithm for detecting blogposts plagiarism / quotations (which are different tasks IMHO) I think you have 2 options: 1. implement a dedicated one based on your features / metrics / domain 2. try to fine tune an existing algorithm that is flexible enough
If I were to do it with Solr I'd probably do something like: 1. index "original" blogposts in Solr (possibly using Jack's suggestion about ngrams / shingles) 2. do MLT queries with "candidate blogposts copies" text 3. get the first, say, 2-3 hits 4. mark it as quote / plagiarism 5. eventually train a classifier to help you mark other texts as quote / plagiarism HTH, Tommaso 2013/7/23 Furkan KAMACI <furkankam...@gmail.com> > Actually I need a specialized algorithm. I want to use that algorithm to > detect duplicate blog posts. > > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com> > > > Hi, > > > > I you may leverage and / or improve MLT component [1]. > > > > HTH, > > Tommaso > > > > [1] : http://wiki.apache.org/solr/MoreLikeThis > > > > > > 2013/7/23 Furkan KAMACI <furkankam...@gmail.com> > > > > > Hi; > > > > > > Sometimes a huge part of a document may exist in another document. As > > like > > > in student plagiarism or quotation of a blog post at another blog post. > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class > to > > > detect it? > > > > > >