Re: Document Similarity Algorithm at Solr/Lucene

2013-08-07 Thread Lance Norskog
Block-quoting and plagiarism are two different questions. Block-quoting is simple: break the text apart into sentences or even paragraphs and make them separate documents. Make facets of the post-analysis text. Now just pull counts of facets and block quotes will be clear. Mahout has a scala

RE: Document Similarity Algorithm at Solr/Lucene

2013-08-05 Thread Alexey Kozhemiakin
ent: Thursday, July 25, 2013 11:18 To: solr-user@lucene.apache.org Subject: Re: Document Similarity Algorithm at Solr/Lucene BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use at underlying? 2013/7/24 Roman Chyla > This paper contains an excellent algorithm for plagiarism dete

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-25 Thread Furkan KAMACI
BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use at underlying? 2013/7/24 Roman Chyla > This paper contains an excellent algorithm for plagiarism detection, but > beware the published version had a mistake in the algorithm - look for > corrections - I can't find them no

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-24 Thread Roman Chyla
This paper contains an excellent algorithm for plagiarism detection, but beware the published version had a mistake in the algorithm - look for corrections - I can't find them now, but I know they have been published (perhaps by one of the co-authors). You could do it with solr, to create an index

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-24 Thread Otis Gospodnetic
more relevant. > > -- Jack Krupansky > > -Original Message- From: Furkan KAMACI > Sent: Tuesday, July 23, 2013 6:16 AM > To: solr-user@lucene.apache.org > Subject: Re: Document Similarity Algorithm at Solr/Lucene > > Actually I need a specialized algorithm. I want to

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Shashi Kant
Here is a paper that I found useful: http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI wrote: > Thanks for your comments. > > 2013/7/23 Tommaso Teofili > >> if you need a specialized algorithm for detecting blogposts plagiarism /

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Furkan KAMACI
Thanks for your comments. 2013/7/23 Tommaso Teofili > if you need a specialized algorithm for detecting blogposts plagiarism / > quotations (which are different tasks IMHO) I think you have 2 options: > 1. implement a dedicated one based on your features / metrics / domain > 2. try to fine tune

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Tommaso Teofili
if you need a specialized algorithm for detecting blogposts plagiarism / quotations (which are different tasks IMHO) I think you have 2 options: 1. implement a dedicated one based on your features / metrics / domain 2. try to fine tune an existing algorithm that is flexible enough If I were to do

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Shawn Heisey
On 7/23/2013 3:33 AM, Furkan KAMACI wrote: > Sometimes a huge part of a document may exist in another document. As like > in student plagiarism or quotation of a blog post at another blog post. > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to > detect it? Solr is designed

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Jack Krupansky
l result count will be high, but scoring assures that the top results will be more relevant. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Tuesday, July 23, 2013 6:16 AM To: solr-user@lucene.apache.org Subject: Re: Document Similarity Algorithm at Solr/Lucene Actually I

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Furkan KAMACI
Actually I need a specialized algorithm. I want to use that algorithm to detect duplicate blog posts. 2013/7/23 Tommaso Teofili > Hi, > > I you may leverage and / or improve MLT component [1]. > > HTH, > Tommaso > > [1] : http://wiki.apache.org/solr/MoreLikeThis > > > 2013/7/23 Furkan KAMACI >

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Tommaso Teofili
Hi, I you may leverage and / or improve MLT component [1]. HTH, Tommaso [1] : http://wiki.apache.org/solr/MoreLikeThis 2013/7/23 Furkan KAMACI > Hi; > > Sometimes a huge part of a document may exist in another document. As like > in student plagiarism or quotation of a blog post at another b