On Wed, 2015-12-02 at 13:00 -0700, Nickolay41189 wrote:
> I try to implement NearDup detection by  SimHash
> <https://moz.com/devblog/near-duplicate-detection/>   algorithm in Solr. 
[...]
> How can I get groups of nearDup by /simhash_signature/?

You could follow the suggested recipe at the page you linked to and
remove the false positives as part of post-processing? Unless you have a
lot of documents that are at the edge between not-similar-enough and
similar-enough, that should be efficient.


So if a SimHash consists of 4*16 bits: ABCD, you would store all
possible 2-part representations: [AB, AC, AD, BA, BC, BD, CA, CB, CD,
DA, DB, DC], either as String-binary (0/1) for easy debug or a bit more
packed with base 16 or 64.

At query time you would do the same permutations and issue a search for
ab OR ac OR ad OR ba OR bc OR bd OR ca OR cb OR cd OR da OR db OR dc

It would even sorta-work with relevance ranking as a match on 2/4 parts
of the SimHash would mean that 2/12 of the query clauses matches, while
a match on 3/4 SimHash-parts means that 6/12 query clauses matches.

- Toke Eskildsen, State and University Library, Denmark


Reply via email to