Sounds a lot like multi-tenancy, where you don't want the document
frequencies of one tenant to influence the query relevancy scores for other
tenants.
No ready solution.
Although, I have thought of a simplified document scoring using just tf and
leaving out df/idf. Not as good a tf*idf or BM25 s
ar-duplicate detection.
>> >
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> > - Original Message
>> > From: Mike Klaas <[EMAIL PROTECTED]>
>> > To: solr-use
bh
> >>
> >> On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> >> wrote:
> >>
> >>
> >> > To whomever started this thread: look at Nutch. I believe something
> >> > related to this already exists in Nutch for near-
On 21-Nov-07, at 12:29 AM, climbingrose wrote:
The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on
Original Message
> From: Mike Klaas <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Sunday, November 18, 2007 11:08:38 PM
> Subject: Re: Near Duplicate Documents
>
> On 18-Nov-07, at 8:17 AM, Eswar K wrote:
>
> > Is there any idea impleme
t; To whomever started this thread: look at Nutch. I believe something
> > > related to this already exists in Nutch for near-duplicate detection.
> > >
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
>
; > From: Mike Klaas <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Sunday, November 18, 2007 11:08:38 PM
> > Subject: Re: Near Duplicate Documents
> >
> > On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> >
> > > Is the
--
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> - Original Message
> From: Mike Klaas <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Sunday, November 18, 2007 11:08:38 PM
> Subject: Re: Near Duplicate Documents
>
> On 18-Nov-
r-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents
On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> Is there any idea implementing that feature in the up coming
releases?
Not currently. Feel free to contribute something if you find a good
solution
On 18-Nov-07, at 8:17 AM, Eswar K wrote:
Is there any idea implementing that feature in the up coming releases?
Not currently. Feel free to contribute something if you find a good
solution .
-Mike
On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
On Nov 18, 2007 10:50
Eswar K wrote:
We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.
The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes
Is there any idea implementing that feature in the up coming releases?
Regards,
Eswar
On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > We have a scenario, where we want to find out documents which are
> similar in
On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> We have a scenario, where we want to find out documents which are similar in
> content. To elaborate a little more on what we mean here, lets take an
> example.
>
> The example of this email chain in which we are interacting on, can be
We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.
The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes (We are not getti
I'm not sure I understand your question...
A "near duplicate document" could mean a LOT of things depending on the
context.
perhaps you just need "fuzzy searching"?
http://lucene.apache.org/java/docs/queryparsersyntax.html#Fuzzy%20Searches
or "proximity searches"?
http://lucene.apache.org/jav
Can anyone help me?
Rishabh
rishabh9 wrote:
>
> Hi,
>
> I am evaluating "Solr 1.2" for my project and wanted to know if it can
> return near duplicate documents (near dups) and how do i go about it? I am
> not sure, but is "MoreLikeThisHandler" the implementation for near dups?
>
> Rishabh
>
16 matches
Mail list logo