Re: Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Jack Krupansky
Sounds a lot like multi-tenancy, where you don't want the document frequencies of one tenant to influence the query relevancy scores for other tenants. No ready solution. Although, I have thought of a simplified document scoring using just tf and leaving out df/idf. Not as good a tf*idf or BM25 s

Re: Near Duplicate Documents

2007-11-23 Thread Ken Krugler
ar-duplicate detection. >> > >> > Otis >> > -- >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> > >> > - Original Message >> > From: Mike Klaas <[EMAIL PROTECTED]> >> > To: solr-use

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
bh > >> > >> On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > >> wrote: > >> > >> > >> > To whomever started this thread: look at Nutch. I believe something > >> > related to this already exists in Nutch for near-

Re: Near Duplicate Documents

2007-11-21 Thread Mike Klaas
On 21-Nov-07, at 12:29 AM, climbingrose wrote: The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on

Re: Near Duplicate Documents

2007-11-21 Thread Ken Krugler
Original Message > From: Mike Klaas <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Sunday, November 18, 2007 11:08:38 PM > Subject: Re: Near Duplicate Documents > > On 18-Nov-07, at 8:17 AM, Eswar K wrote: > > > Is there any idea impleme

Re: Near Duplicate Documents

2007-11-21 Thread Rishabh Joshi
t; To whomever started this thread: look at Nutch. I believe something > > > related to this already exists in Nutch for near-duplicate detection. > > > > > > Otis > > > -- > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > >

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
; > From: Mike Klaas <[EMAIL PROTECTED]> > > To: solr-user@lucene.apache.org > > Sent: Sunday, November 18, 2007 11:08:38 PM > > Subject: Re: Near Duplicate Documents > > > > On 18-Nov-07, at 8:17 AM, Eswar K wrote: > > > > > Is the

Re: Near Duplicate Documents

2007-11-20 Thread Rishabh Joshi
-- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Mike Klaas <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Sunday, November 18, 2007 11:08:38 PM > Subject: Re: Near Duplicate Documents > > On 18-Nov-

Re: Near Duplicate Documents

2007-11-20 Thread Otis Gospodnetic
r-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: > Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution

Re: Near Duplicate Documents

2007-11-18 Thread Mike Klaas
On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution . -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: On Nov 18, 2007 10:50

Re: Near Duplicate Documents

2007-11-18 Thread Ryan McKinley
Eswar K wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes

Re: Near Duplicate Documents

2007-11-18 Thread Eswar K
Is there any idea implementing that feature in the up coming releases? Regards, Eswar On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > We have a scenario, where we want to find out documents which are > similar in

Re: Near Duplicate Documents

2007-11-18 Thread Stuart Sierra
On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > We have a scenario, where we want to find out documents which are similar in > content. To elaborate a little more on what we mean here, lets take an > example. > > The example of this email chain in which we are interacting on, can be

Re: Near Duplicate Documents

2007-11-18 Thread Eswar K
We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getti

Re: Near Duplicate Documents

2007-11-18 Thread Ryan McKinley
I'm not sure I understand your question... A "near duplicate document" could mean a LOT of things depending on the context. perhaps you just need "fuzzy searching"? http://lucene.apache.org/java/docs/queryparsersyntax.html#Fuzzy%20Searches or "proximity searches"? http://lucene.apache.org/jav

Re: Near Duplicate Documents

2007-11-18 Thread rishabh9
Can anyone help me? Rishabh rishabh9 wrote: > > Hi, > > I am evaluating "Solr 1.2" for my project and wanted to know if it can > return near duplicate documents (near dups) and how do i go about it? I am > not sure, but is "MoreLikeThisHandler" the implementation for near dups? > > Rishabh >