Block-quoting and plagiarism are two different questions.
Block-quoting is simple: break the text apart into sentences or even
paragraphs and make them separate documents. Make facets of the
post-analysis text. Now just pull counts of facets and block quotes will
be clear.
Mahout has a scala
ent: Thursday, July 25, 2013 11:18
To: solr-user@lucene.apache.org
Subject: Re: Document Similarity Algorithm at Solr/Lucene
BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use at
underlying?
2013/7/24 Roman Chyla
> This paper contains an excellent algorithm for plagiarism dete
BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use
at underlying?
2013/7/24 Roman Chyla
> This paper contains an excellent algorithm for plagiarism detection, but
> beware the published version had a mistake in the algorithm - look for
> corrections - I can't find them no
This paper contains an excellent algorithm for plagiarism detection, but
beware the published version had a mistake in the algorithm - look for
corrections - I can't find them now, but I know they have been published
(perhaps by one of the co-authors). You could do it with solr, to create an
index
more relevant.
>
> -- Jack Krupansky
>
> -Original Message- From: Furkan KAMACI
> Sent: Tuesday, July 23, 2013 6:16 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Document Similarity Algorithm at Solr/Lucene
>
> Actually I need a specialized algorithm. I want to
Here is a paper that I found useful:
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI wrote:
> Thanks for your comments.
>
> 2013/7/23 Tommaso Teofili
>
>> if you need a specialized algorithm for detecting blogposts plagiarism /
Thanks for your comments.
2013/7/23 Tommaso Teofili
> if you need a specialized algorithm for detecting blogposts plagiarism /
> quotations (which are different tasks IMHO) I think you have 2 options:
> 1. implement a dedicated one based on your features / metrics / domain
> 2. try to fine tune
if you need a specialized algorithm for detecting blogposts plagiarism /
quotations (which are different tasks IMHO) I think you have 2 options:
1. implement a dedicated one based on your features / metrics / domain
2. try to fine tune an existing algorithm that is flexible enough
If I were to do
On 7/23/2013 3:33 AM, Furkan KAMACI wrote:
> Sometimes a huge part of a document may exist in another document. As like
> in student plagiarism or quotation of a blog post at another blog post.
> Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
> detect it?
Solr is designed
l result count will be high, but scoring assures
that the top results will be more relevant.
-- Jack Krupansky
-Original Message-
From: Furkan KAMACI
Sent: Tuesday, July 23, 2013 6:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Document Similarity Algorithm at Solr/Lucene
Actually I
Actually I need a specialized algorithm. I want to use that algorithm to
detect duplicate blog posts.
2013/7/23 Tommaso Teofili
> Hi,
>
> I you may leverage and / or improve MLT component [1].
>
> HTH,
> Tommaso
>
> [1] : http://wiki.apache.org/solr/MoreLikeThis
>
>
> 2013/7/23 Furkan KAMACI
>
Hi,
I you may leverage and / or improve MLT component [1].
HTH,
Tommaso
[1] : http://wiki.apache.org/solr/MoreLikeThis
2013/7/23 Furkan KAMACI
> Hi;
>
> Sometimes a huge part of a document may exist in another document. As like
> in student plagiarism or quotation of a blog post at another b
12 matches
Mail list logo