Re: Near Duplicate Documents

Ryan McKinley Sun, 18 Nov 2007 08:19:15 -0800

Eswar K wrote:

We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.


The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes (We are not getting confused
with threads, they are two different things.). Each email in this thread is
treated as a document by the system. A reply to the original mail also
includes the original mail in which case it becomes a near duplicate of the
orginal mail (depending on the percentage of similarity).  Similarly it goes
on. The near dupes need not be limited to emails.

If we want to have such capability using Solr, can we use
MoreLikeThisHandler or is there any other appropriate handler in Solr which
we can use? What is the best way for achieving such a functionality?

mess around with the MoreLikeThisHandler, see if it gives you what youare looking for.


Check:
http://wiki.apache.org/solr/MoreLikeThis

For your example, you would want to make sure that the 'type' field("email") is in the mlt.fl param. Perhaps: mlt.fl=type,content

Re: Near Duplicate Documents

Reply via email to