We considered MLT component to implemented a sort of "near exact duplicate 
detection" - which is probably very similar to your task.

http://wiki.apache.org/solr/MoreLikeThis 

You may think of MoreLikeThis as a two phase process (transform a document to 
query and run it): 
        1a) it tokenizes input document to words stream
        1b) removes too long and too short words (mlt.minwl, mlt.maxwl 
parameters) and rare words based on their appearance in this document and 
entire corpus (mlt.mintf, mlt.mindf parameters)
        1c) optionally assigns a boost(tf-idf coeff) to remaining words  
(mlt.boost parameter)
        1d) then it takes top N=mlt.maxqt words  to shorten the query 
        2) executes a "regular" query against fields listed in mlt.fl
As you see there's no magic here. All steps are configurable and you can chose 
balance between performance and quality.

Our near exact duplication algorithm relying of a fact that a document should 
be most similar to itself, so if we have D1...D5 documents, then  - 
MLT(D2)={D2, D1, D4, D5, D4}.
In order to achieve this we had to disable field Norms to avoid shorter 
documents being more similar to given document  then itself. 
Therefore we can normalize document score as 100% and consider as a "near exact 
duplicate" every other doc which returned from MLT with score above certain 
threshold (90% in our case).

We also added a secret sauce  - filter documents by length in spot of +/- 20%. 
Otherwise, because of "fingerprint" nature of MLT we get incorrectly identified 
documents that were too long or too short.

In a nutshell the flow is as described above:
1) index document - to ensure it will be returned by MLT
2) execute MLT
3) take docs which length is inside +/-20% range of this document
4) normalize all scores by score of first document (should be itself)
5) take docs which score is >90%



Hope it helps, please let me know if you need more details.


Alex 


-----Original Message-----
From: Furkan KAMACI [mailto:furkankam...@gmail.com] 
Sent: Thursday, July 25, 2013 11:18
To: solr-user@lucene.apache.org
Subject: Re: Document Similarity Algorithm at Solr/Lucene

BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use at 
underlying?


2013/7/24 Roman Chyla <roman.ch...@gmail.com>

> This paper contains an excellent algorithm for plagiarism detection, 
> but beware the published version had a mistake in the algorithm - look 
> for corrections - I can't find them now, but I know they have been 
> published (perhaps by one of the co-authors). You could do it with 
> solr, to create an index of hashes, with the twist of storing position 
> of the original text (source of the hash) together with the token and 
> the solr highlighting would do the rest for you :)
>
> roman
>
>
> On Tue, Jul 23, 2013 at 11:07 AM, Shashi Kant <sk...@sloan.mit.edu> wrote:
>
> > Here is a paper that I found useful:
> > http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
> >
> >
> > On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI 
> > <furkankam...@gmail.com>
> > wrote:
> > > Thanks for your comments.
> > >
> > > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com>
> > >
> > >> if you need a specialized algorithm for detecting blogposts
> plagiarism /
> > >> quotations (which are different tasks IMHO) I think you have 2
> options:
> > >> 1. implement a dedicated one based on your features / metrics / 
> > >> domain 2. try to fine tune an existing algorithm that is flexible 
> > >> enough
> > >>
> > >> If I were to do it with Solr I'd probably do something like:
> > >> 1. index "original" blogposts in Solr (possibly using Jack's
> suggestion
> > >> about ngrams / shingles)
> > >> 2. do MLT queries with "candidate blogposts copies" text 3. get 
> > >> the first, say, 2-3 hits 4. mark it as quote / plagiarism 5. 
> > >> eventually train a classifier to help you mark other texts as
> quote /
> > >> plagiarism
> > >>
> > >> HTH,
> > >> Tommaso
> > >>
> > >>
> > >>
> > >> 2013/7/23 Furkan KAMACI <furkankam...@gmail.com>
> > >>
> > >> > Actually I need a specialized algorithm. I want to use that
> algorithm
> > to
> > >> > detect duplicate blog posts.
> > >> >
> > >> > 2013/7/23 Tommaso Teofili <tommaso.teof...@gmail.com>
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I you may leverage and / or improve MLT component [1].
> > >> > >
> > >> > > HTH,
> > >> > > Tommaso
> > >> > >
> > >> > > [1] : http://wiki.apache.org/solr/MoreLikeThis
> > >> > >
> > >> > >
> > >> > > 2013/7/23 Furkan KAMACI <furkankam...@gmail.com>
> > >> > >
> > >> > > > Hi;
> > >> > > >
> > >> > > > Sometimes a huge part of a document may exist in another
> > document. As
> > >> > > like
> > >> > > > in student plagiarism or quotation of a blog post at 
> > >> > > > another
> blog
> > >> post.
> > >> > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has 
> > >> > > > any
> > class
> > >> > to
> > >> > > > detect it?
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>

Reply via email to