Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Yeah, I read the overview which suggests that duplicates can be prevented from entering the index and scanned the rest, it does not look like you can actually drop the document entirely. Maybe I am missing something here. François On Jun 28, 2011, at 9:14 AM, Mohammad Shariq wrote: > Hey Franç

Re: Removing duplicate documents from search results

2011-06-28 Thread Paul Libbrecht
Mohammad, just in case you meant it, I would like to discourage you to try to deduplicate *the search result*. There are many things that go wrong if you do that; we had it in one version of the ActiveMath search environment (which uses Lucene): - paging is inappropriate - total count is wrong u

Re: Removing duplicate documents from search results

2011-06-28 Thread Mohammad Shariq
Hey François, thanks for your suggestion, I followed the same link ( http://wiki.apache.org/solr/Deduplication) they have the solution*, either make Hash as uniqueKey OR overwrite on duplicate, I dont need either. I need Discard on Duplicate. * > > > I have not used it but it looks like it will

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Indeed, take a look at this: http://wiki.apache.org/solr/Deduplication I have not used it but it looks like it will do the trick. François On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote: > I found the deduplication thing really useful. Although I have not yet > started to wo

Re: Removing duplicate documents from search results

2011-06-28 Thread Pranav Prakash
I found the deduplication thing really useful. Although I have not yet started to work on it, as there are some other low hanging fruits I've to capture. Will share my thoughts soon. *Pranav Prakash* "temet nosce" Twitter | Blog | Goo

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Maybe there is a way to get Solr to reject documents that already exist in the index but I doubt it, maybe someone else with can chime here here. You could do a search for each document prior to indexing it so see if it is already in the index, that is probably non-optimal, maybe it is easiest t

Re: Removing duplicate documents from search results

2011-06-28 Thread Mohammad Shariq
I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 F

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: > I also have the problem of duplicate docs. > I am indexing news articles, Every news article will have the source URL, > I

Re: Removing duplicate documents from search results

2011-06-28 Thread Mohammad Shariq
I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon wrote: > have you checked out the deduplication pr

Re: Removing duplicate documents from search results

2011-06-23 Thread simon
have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash wrote: > This approach would definitely work is the two documents are *Exact

Re: Removing duplicate documents from search results

2011-06-23 Thread pravesh
Would you care to even index the duplicate documents? Finding duplicacy in content fields would be not so easy as in some untokenized/keyword field. May be you could do this filtering at indexing time before sending the document to SOLR. Then the question comes, which one document should go(from a

Re: Removing duplicate documents from search results

2011-06-23 Thread Pranav Prakash
This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95%

Re: Removing duplicate documents from search results

2011-06-23 Thread Omri Cohen
What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295