Re: Near Duplicate Documents

Eswar K Sun, 18 Nov 2007 07:51:09 -0800

We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.

The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes (We are not getting confused
with threads, they are two different things.). Each email in this thread is
treated as a document by the system. A reply to the original mail also
includes the original mail in which case it becomes a near duplicate of the
orginal mail (depending on the percentage of similarity).  Similarly it goes
on. The near dupes need not be limited to emails.

If we want to have such capability using Solr, can we use
MoreLikeThisHandler or is there any other appropriate handler in Solr which
we can use? What is the best way for achieving such a functionality?

Regards,
Eswar

On Nov 18, 2007 9:06 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:

> I'm not sure I understand your question...
>
> A "near duplicate document" could mean a LOT of things depending on the
> context.
>
> perhaps you just need "fuzzy searching"?
> http://lucene.apache.org/java/docs/queryparsersyntax.html#Fuzzy%20Searches
>
> or "proximity searches"?
>
> http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches
>
>
> MoreLikeThisHandler (added in 1.3-dev) may be able to help, but it is
> used to search for other similar documents based on the results of
> another query.
>
> ryan
>
>
> rishabh9 wrote:
> > Can anyone help me?
> >
> > Rishabh
> >
> >
> > rishabh9 wrote:
> >> Hi,
> >>
> >> I am evaluating "Solr 1.2" for my project and wanted to know if it can
> >> return near duplicate documents (near dups) and how do i go about it? I
> am
> >> not sure, but is "MoreLikeThisHandler" the implementation for near
> dups?
> >>
> >> Rishabh
> >>
> >>
> >
>
>

Re: Near Duplicate Documents

Reply via email to