Re: Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Jack Krupansky
Sounds a lot like multi-tenancy, where you don't want the document frequencies of one tenant to influence the query relevancy scores for other tenants. No ready solution. Although, I have thought of a simplified document scoring using just tf and leaving out df/idf. Not as good a tf*idf or BM25 s

Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Chris Morley
Hey Solr people: Suppose that we did not want to break up our document set into separate indexes, but had certain cases where many versions of a document were not relevant for certain searches. I guess this could be thought of as a "authorization" class of problem, however it is not that

Re: Near Duplicate Documents

2007-11-23 Thread Ken Krugler
there is an >> implementation to obtain de-duplicate documents/pages but none for Near >> Duplicates documents. Can you guide me a little further as to where exactly > > under Nutch I should be concentrating, regarding near duplicate documents? > > > &g

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
v 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote: > >> Otis, > >> > >> Thanks for your response. > >> > > > I just gave a quick look to the Nutch Forum and find that there is an > >> implementation to obtain de-duplicate documents/pages

Re: Near Duplicate Documents

2007-11-21 Thread Mike Klaas
On 21-Nov-07, at 12:29 AM, climbingrose wrote: The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on

Re: Near Duplicate Documents

2007-11-21 Thread Ken Krugler
implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly > under Nutch I should be concentrating, regarding near duplicate documents? > > Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis

Re: Near Duplicate Documents

2007-11-21 Thread Rishabh Joshi
> > Otis, > > > > Thanks for your response. > > > > I just gave a quick look to the Nutch Forum and find that there is an > > implementation to obtain de-duplicate documents/pages but none for Near > > Duplicates documents. Can you guide me a little furth

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
licates documents. Can you guide me a little further as to where exactly > under Nutch I should be concentrating, regarding near duplicate documents? > > Regards, > Rishabh > > On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > wrote: > > > > To whom

Re: Near Duplicate Documents

2007-11-20 Thread Rishabh Joshi
, regarding near duplicate documents? Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > To whomever started this thread: look at Nutch. I believe something > related to this already exists in Nutch for near-duplicate detection. > > Otis >

Re: Near Duplicate Documents

2007-11-20 Thread Otis Gospodnetic
r-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: > Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution

Re: Near Duplicate Documents

2007-11-18 Thread Mike Klaas
On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution . -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: On Nov 18, 2007 10:50

Re: Near Duplicate Documents

2007-11-18 Thread Ryan McKinley
Eswar K wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes

Re: Near Duplicate Documents

2007-11-18 Thread Eswar K
Is there any idea implementing that feature in the up coming releases? Regards, Eswar On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > We have a scenario, where we want to find out documents which are > similar in

Re: Near Duplicate Documents

2007-11-18 Thread Stuart Sierra
On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > We have a scenario, where we want to find out documents which are similar in > content. To elaborate a little more on what we mean here, lets take an > example. > > The example of this email chain in which we are interacting on, can be

Re: Near Duplicate Documents

2007-11-18 Thread Eswar K
o search for other similar documents based on the results of > another query. > > ryan > > > rishabh9 wrote: > > Can anyone help me? > > > > Rishabh > > > > > > rishabh9 wrote: > >> Hi, > >> > >> I am evaluating "Solr 1.2&qu

Re: Near Duplicate Documents

2007-11-18 Thread Ryan McKinley
ishabh rishabh9 wrote: Hi, I am evaluating "Solr 1.2" for my project and wanted to know if it can return near duplicate documents (near dups) and how do i go about it? I am not sure, but is "MoreLikeThisHandler" the implementation for near dups? Rishabh

Re: Near Duplicate Documents

2007-11-18 Thread rishabh9
Can anyone help me? Rishabh rishabh9 wrote: > > Hi, > > I am evaluating "Solr 1.2" for my project and wanted to know if it can > return near duplicate documents (near dups) and how do i go about it? I am > not sure, but is "MoreLikeThisHandler" the impl

Near Duplicate Documents

2007-11-16 Thread Rishabh Joshi
Hi, I am evaluating "Solr 1.2" for my project and wanted to know if it can return near duplicate documents (near dups) and how do i go about it? I am not sure, but is "MoreLikeThisHandler" the implementation for near dups? Rishabh