On Wed, Dec 2, 2015 at 9:00 PM, Nickolay41189 <klin892...@yandex.ru> wrote:
> I try to implement NearDup detection by  SimHash
> <https://moz.com/devblog/near-duplicate-detection/>   algorithm in Solr.
> Let's say:
> 1) each document has a field /simhash_signature/ that stores a sequence of
> bits.
> 2) that in order to be considered NearDup, documents must have, at most, 2
> bits that differ in /simhash_signature/
>
>
> *My question:*
> How can I get groups of nearDup by /simhash_signature/?
>
> *Examples:*
>   Input:
>     Doc A = 0001000
>     Doc B = 1000000
>     Doc C = 1111111
>     Doc D = 0101000
>   Output:
>     A -> {B, D}
>     B -> {A}
>     C -> {}
>     D -> {A}
I'm not sure if this is the best solution (or, indeed, if it is at all
possible), but maybe you could store the bit fields as strings, then
use strdist function to find Levenshtein distance between the strings
and group by that.

-- 
Nikola Smolenski

University of Belgrade
University library ''Svetozar Markovic''

Reply via email to