have you checked out the deduplication process that's available at
indexing time ? This includes a fuzzy hash algorithm .

http://wiki.apache.org/solr/Deduplication

-Simon

On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com> wrote:
> This approach would definitely work is the two documents are *Exactly* the
> same. But this is very fragile. Even if one extra space has been added, the
> whole hash would change. What I am really looking for is some %age
> similarity between documents, and remove those documents which are more than
> 95% similar.
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
> Google <http://www.google.com/profiles/pranny>
>
>
> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote:
>
>> What you need to do, is to calculate some HASH (using any message digest
>> algorithm you want, md5, sha-1 and so on), then do some reading on solr
>> field collapse capabilities. Should not be too complicated..
>>
>> *Omri Cohen*
>>
>>
>>
>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295
>>
>>
>>
>>
>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> [image:
>> Twitter] <http://www.twitter.com/omricohe> [image:
>> WordPress]<http://omricohen.me>
>>  Please consider your environmental responsibility. Before printing this
>> e-mail message, ask yourself whether you really need a hard copy.
>> IMPORTANT: The contents of this email and any attachments are confidential.
>> They are intended for the named recipient(s) only. If you have received
>> this
>> email by mistake, please notify the sender immediately and do not disclose
>> the contents to anyone or make copies thereof.
>> Signature powered by
>> <
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>> >
>> WiseStamp<
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>> >
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Pranav Prakash <pra...@gmail.com>
>> Date: Thu, Jun 23, 2011 at 12:26 PM
>> Subject: Removing duplicate documents from search results
>> To: solr-user@lucene.apache.org
>>
>>
>> How can I remove very similar documents from search results?
>>
>> My scenario is that there are documents in the index which are almost
>> similar (people submitting same stuff multiple times, sometimes different
>> people submitting same stuff). Now when a search is performed for
>> "keyword",
>> in the top N results, quite frequently, same document comes up multiple
>> times. I want to remove those duplicate (or possible duplicate) documents.
>> Very similar to what Google does when they say "In order to show you most
>> relevant result, duplicates have been removed". How can I achieve this
>> functionality using Solr? Does Solr has an implied or plugin which could
>> help me with it?
>>
>>
>> *Pranav Prakash*
>>
>> "temet nosce"
>>
>> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com
>> >
>> |
>> Google <http://www.google.com/profiles/pranny>
>>
>

Reply via email to