On Wed, Dec 2, 2015 at 9:00 PM, Nickolay41189 <klin892...@yandex.ru> wrote: > I try to implement NearDup detection by SimHash > <https://moz.com/devblog/near-duplicate-detection/> algorithm in Solr. > Let's say: > 1) each document has a field /simhash_signature/ that stores a sequence of > bits. > 2) that in order to be considered NearDup, documents must have, at most, 2 > bits that differ in /simhash_signature/ > > > *My question:* > How can I get groups of nearDup by /simhash_signature/? > > *Examples:* > Input: > Doc A = 0001000 > Doc B = 1000000 > Doc C = 1111111 > Doc D = 0101000 > Output: > A -> {B, D} > B -> {A} > C -> {} > D -> {A}
I'm not sure if this is the best solution (or, indeed, if it is at all possible), but maybe you could store the bit fields as strings, then use strdist function to find Levenshtein distance between the strings and group by that. -- Nikola Smolenski University of Belgrade University library ''Svetozar Markovic''