: I try to implement NearDup detection by  SimHash

I'm not really familiar with simhash, but based on your description of it, 
i'm not sure that any of Solr's deduplication, grouping, or collapsing 
features will really help you here...

: 1) each document has a field /simhash_signature/ that stores a sequence of
: bits.
: 2) that in order to be considered NearDup, documents must have, at most, 2
: bits that differ in /simhash_signature/
: 
: *My question:*
: How can I get groups of nearDup by /simhash_signature/?

the problem here is that there is no transative property in your 
definition of a "NearDup" -- as you point out in your example, B & D are 
both "NearDups" or A, but B & D are not NearDups of eachother.

Some sort of transative relationship (either in terms of an identical 
field value, or a function that can produce identical results for all 
documents i na group) is neccessary to use Solr's de-duplication, 
collapsing, or grouping functionality.

Assuming you wanted results like those below, and you had some existing 
"query + sort" that would identiy the "main" document result set (the "Doc 
A', "Doc B", "Doc C", "Doc D" list in that order) you could -- in theory 
-- write a custom DocTransformer that could annotate those documents with 
a list of doc IDs that had "NearDup" values for some field (possily doing 
strdist, or some other more efficient binary bit set diff as a 
ValueSource) 

If you wanted to pursue implementing a DocTransofrmer like this as a 
plugin, the existing ChildDocTransformerFactory might be a good starting 
point for some code to study.

: *Examples:*
:   Input:
:     Doc A = 0001000
:     Doc B = 1000000
:     Doc C = 1111111
:     Doc D = 0101000
:   Output:
:     A -> {B, D}
:     B -> {A}
:     C -> {}
:     D -> {A}


-Hoss
http://www.lucidworks.com/

Reply via email to