Sounds a lot like multi-tenancy, where you don't want the document
frequencies of one tenant to influence the query relevancy scores for other
tenants.
No ready solution.
Although, I have thought of a simplified document scoring using just tf and
leaving out df/idf. Not as good a tf*idf or BM25 s
Hey Solr people:
Suppose that we did not want to break up our document set into separate
indexes, but had certain cases where many versions of a document were not
relevant for certain searches.
I guess this could be thought of as a "authorization" class of problem,
however it is not that
there is an
>> implementation to obtain de-duplicate documents/pages but none for Near
>> Duplicates documents. Can you guide me a little further as to
where exactly
> > under Nutch I should be concentrating, regarding near
duplicate documents?
> >
> &g
v 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> >> Otis,
> >>
> >> Thanks for your response.
> >>
> > > I just gave a quick look to the Nutch Forum and find that there is an
> >> implementation to obtain de-duplicate documents/pages
On 21-Nov-07, at 12:29 AM, climbingrose wrote:
The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on
implementation to obtain de-duplicate documents/pages but none for Near
Duplicates documents. Can you guide me a little further as to where exactly
> under Nutch I should be concentrating, regarding near duplicate documents?
>
> Regards,
Rishabh
On Nov 21, 2007 12:41 PM, Otis
> > Otis,
> >
> > Thanks for your response.
> >
> > I just gave a quick look to the Nutch Forum and find that there is an
> > implementation to obtain de-duplicate documents/pages but none for Near
> > Duplicates documents. Can you guide me a little furth
licates documents. Can you guide me a little further as to where exactly
> under Nutch I should be concentrating, regarding near duplicate documents?
>
> Regards,
> Rishabh
>
> On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
>
>
> > To whom
, regarding near duplicate documents?
Regards,
Rishabh
On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:
> To whomever started this thread: look at Nutch. I believe something
> related to this already exists in Nutch for near-duplicate detection.
>
> Otis
>
r-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents
On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> Is there any idea implementing that feature in the up coming
releases?
Not currently. Feel free to contribute something if you find a good
solution
On 18-Nov-07, at 8:17 AM, Eswar K wrote:
Is there any idea implementing that feature in the up coming releases?
Not currently. Feel free to contribute something if you find a good
solution .
-Mike
On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
On Nov 18, 2007 10:50
Eswar K wrote:
We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.
The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes
Is there any idea implementing that feature in the up coming releases?
Regards,
Eswar
On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > We have a scenario, where we want to find out documents which are
> similar in
On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> We have a scenario, where we want to find out documents which are similar in
> content. To elaborate a little more on what we mean here, lets take an
> example.
>
> The example of this email chain in which we are interacting on, can be
o search for other similar documents based on the results of
> another query.
>
> ryan
>
>
> rishabh9 wrote:
> > Can anyone help me?
> >
> > Rishabh
> >
> >
> > rishabh9 wrote:
> >> Hi,
> >>
> >> I am evaluating "Solr 1.2&qu
ishabh
rishabh9 wrote:
Hi,
I am evaluating "Solr 1.2" for my project and wanted to know if it can
return near duplicate documents (near dups) and how do i go about it? I am
not sure, but is "MoreLikeThisHandler" the implementation for near dups?
Rishabh
Can anyone help me?
Rishabh
rishabh9 wrote:
>
> Hi,
>
> I am evaluating "Solr 1.2" for my project and wanted to know if it can
> return near duplicate documents (near dups) and how do i go about it? I am
> not sure, but is "MoreLikeThisHandler" the impl
Hi,
I am evaluating "Solr 1.2" for my project and wanted to know if it can
return near duplicate documents (near dups) and how do i go about it? I am
not sure, but is "MoreLikeThisHandler" the implementation for near dups?
Rishabh
18 matches
Mail list logo