Have you considered removing them at index time? See:
http://wiki.apache.org/solr/Deduplication
Best
Erick
On Fri, Nov 25, 2011 at 3:13 PM, Ted Dunning wrote:
> See http://en.wikipedia.org/wiki/Locality-sensitive_hashing
>
> The obvious thought that I had just after hitting send was that you cou
See http://en.wikipedia.org/wiki/Locality-sensitive_hashing
The obvious thought that I had just after hitting send was that you could
put the LSH signatures on the documents. That would let you do the scan at
low volume and using LSH would make the duplicate scan almost as fast as
your score scan
thanks. i did consider postprocessing and may wind up doing that, i was
hoping there was a way to have Solr do it for me! that I have to as this
question is probably not a good sign, but what is LSH clustering?
On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning wrote:
> You can do that pretty easily
You can do that pretty easily by just retrieving extra documents and post
processing the results list.
You are likely to have a significant number of apparent duplicates this
way.
To really get rid of duplicates in results, it might be better to remove
them from the corpus by deploying something
I have a corpus that has a lot of identical or nearly identical documents.
I'd like to return only the unique ones (excluding the "nearly identical"
which are redirects). I notice that all the identical/nearly identicals
have identical Solr scores. How can I tell Solr to throw out all the
success