We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example.
The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. If we want to have such capability using Solr, can we use MoreLikeThisHandler or is there any other appropriate handler in Solr which we can use? What is the best way for achieving such a functionality? Regards, Eswar On Nov 18, 2007 9:06 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote: > I'm not sure I understand your question... > > A "near duplicate document" could mean a LOT of things depending on the > context. > > perhaps you just need "fuzzy searching"? > http://lucene.apache.org/java/docs/queryparsersyntax.html#Fuzzy%20Searches > > or "proximity searches"? > > http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches > > > MoreLikeThisHandler (added in 1.3-dev) may be able to help, but it is > used to search for other similar documents based on the results of > another query. > > ryan > > > rishabh9 wrote: > > Can anyone help me? > > > > Rishabh > > > > > > rishabh9 wrote: > >> Hi, > >> > >> I am evaluating "Solr 1.2" for my project and wanted to know if it can > >> return near duplicate documents (near dups) and how do i go about it? I > am > >> not sure, but is "MoreLikeThisHandler" the implementation for near > dups? > >> > >> Rishabh > >> > >> > > > >