We considered MLT component to implemented a sort of "near exact duplicate detection" - which is probably very similar to your task.
http://wiki.apache.org/solr/MoreLikeThis You may think of MoreLikeThis as a two phase process (transform a document to query and run it): 1a) it tokenizes input document to words stream 1b) removes too long and too short words (mlt.minwl, mlt.maxwl parameters) and rare words based on their appearance in this document and entire corpus (mlt.mintf, mlt.mindf parameters) 1c) optionally assigns a boost(tf-idf coeff) to remaining words (mlt.boost parameter) 1d) then it takes top N=mlt.maxqt words to shorten the query 2) executes a "regular" query against fields listed in mlt.fl As you see there's no magic here. All steps are configurable and you can chose balance between performance and quality. Our near exact duplication algorithm relying of a fact that a document should be most similar to itself, so if we have D1...D5 documents, then - MLT(D2)={D2, D1, D4, D5, D4}. In order to achieve this we had to disable field Norms to avoid shorter documents being more similar to given document then itself. Therefore we can normalize document score as 100% and consider as a "near exact duplicate" every other doc which returned from MLT with score above certain threshold (90% in our case). We also added a secret sauce - filter documents by length in spot of +/- 20%. Otherwise, because of "fingerprint" nature of MLT we get incorrectly identified documents that were too long or too short. In a nutshell the flow is as described above: 1) index document - to ensure it will be returned by MLT 2) execute MLT 3) take docs which length is inside +/-20% range of this document 4) normalize all scores by score of first document (should be itself) 5) take docs which score is >90% Hope it helps, please let me know if you need more details. Alex -----Original Message----- From: Furkan KAMACI [mailto:furkankam...@gmail.com] Sent: Thursday, July 25, 2013 11:18 To: solr-user@lucene.apache.org Subject: Re: Document Similarity Algorithm at Solr/Lucene BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use at underlying? 2013/7/24 Roman Chyla <roman.ch...@gmail.com> > This paper contains an excellent algorithm for plagiarism detection, > but beware the published version had a mistake in the algorithm - look > for corrections - I can't find them now, but I know they have been > published (perhaps by one of the co-authors). You could do it with > solr, to create an index of hashes, with the twist of storing position > of the original text (source of the hash) together with the token and > the solr highlighting would do the rest for you :) > > roman > > > On Tue, Jul 23, 2013 at 11:07 AM, Shashi Kant <sk...@sloan.mit.edu> wrote: > > > Here is a paper that I found useful: > > http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf > > > > > > On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI > > <furkankam...@gmail.com> > > wrote: > > > Thanks for your comments. > > > > > > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com> > > > > > >> if you need a specialized algorithm for detecting blogposts > plagiarism / > > >> quotations (which are different tasks IMHO) I think you have 2 > options: > > >> 1. implement a dedicated one based on your features / metrics / > > >> domain 2. try to fine tune an existing algorithm that is flexible > > >> enough > > >> > > >> If I were to do it with Solr I'd probably do something like: > > >> 1. index "original" blogposts in Solr (possibly using Jack's > suggestion > > >> about ngrams / shingles) > > >> 2. do MLT queries with "candidate blogposts copies" text 3. get > > >> the first, say, 2-3 hits 4. mark it as quote / plagiarism 5. > > >> eventually train a classifier to help you mark other texts as > quote / > > >> plagiarism > > >> > > >> HTH, > > >> Tommaso > > >> > > >> > > >> > > >> 2013/7/23 Furkan KAMACI <furkankam...@gmail.com> > > >> > > >> > Actually I need a specialized algorithm. I want to use that > algorithm > > to > > >> > detect duplicate blog posts. > > >> > > > >> > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com> > > >> > > > >> > > Hi, > > >> > > > > >> > > I you may leverage and / or improve MLT component [1]. > > >> > > > > >> > > HTH, > > >> > > Tommaso > > >> > > > > >> > > [1] : http://wiki.apache.org/solr/MoreLikeThis > > >> > > > > >> > > > > >> > > 2013/7/23 Furkan KAMACI <furkankam...@gmail.com> > > >> > > > > >> > > > Hi; > > >> > > > > > >> > > > Sometimes a huge part of a document may exist in another > > document. As > > >> > > like > > >> > > > in student plagiarism or quotation of a blog post at > > >> > > > another > blog > > >> post. > > >> > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has > > >> > > > any > > class > > >> > to > > >> > > > detect it? > > >> > > > > > >> > > > > >> > > > >> > > >