On Wed, 2015-12-02 at 13:00 -0700, Nickolay41189 wrote: > I try to implement NearDup detection by SimHash > <https://moz.com/devblog/near-duplicate-detection/> algorithm in Solr. [...] > How can I get groups of nearDup by /simhash_signature/?
You could follow the suggested recipe at the page you linked to and remove the false positives as part of post-processing? Unless you have a lot of documents that are at the edge between not-similar-enough and similar-enough, that should be efficient. So if a SimHash consists of 4*16 bits: ABCD, you would store all possible 2-part representations: [AB, AC, AD, BA, BC, BD, CA, CB, CD, DA, DB, DC], either as String-binary (0/1) for easy debug or a bit more packed with base 16 or 64. At query time you would do the same permutations and issue a search for ab OR ac OR ad OR ba OR bc OR bd OR ca OR cb OR cd OR da OR db OR dc It would even sorta-work with relevance ranking as a match on 2/4 parts of the SimHash would mean that 2/12 of the query clauses matches, while a match on 3/4 SimHash-parts means that 6/12 query clauses matches. - Toke Eskildsen, State and University Library, Denmark