I try to implement NearDup detection by  SimHash
<https://moz.com/devblog/near-duplicate-detection/>   algorithm in Solr. 
Let's say:
1) each document has a field /simhash_signature/ that stores a sequence of
bits.
2) that in order to be considered NearDup, documents must have, at most, 2
bits that differ in /simhash_signature/


*My question:*
How can I get groups of nearDup by /simhash_signature/?

*Examples:*
  Input:
    Doc A = 0001000
    Doc B = 1000000
    Doc C = 1111111
    Doc D = 0101000
  Output:
    A -> {B, D}
    B -> {A}
    C -> {}
    D -> {A}



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Grouping-by-simhash-signature-tp4243236.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to