I am trying to find out duplicate records based on distance and phonetic algorithms. Can I utilize solr for that? I have following fields and conditions to identify exact or possible duplicates.
1. Fields prefix suffix firstname lastname email(primary_email1, email2, email3) phone(primary_phone1, phone2, phone3) 2. Conditions: Two records said to be exact duplicates if 1. IsExactMatchFunction(record1_prefix, record2_prefix) AND IsExactMatchFunction(record1_suffix, record2_suffix) AND IsExactMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_primary_email,record2_primary_email) OR IsExactMatchFunction(record1_primary_phone,record2_primary_primary) Two records said to be possible duplicates if 1. IsExactMatchFunction(record1_prefix, record2_prefix) OR IsExactMatchFunction(record1_suffix, record2_suffix) OR IsExactMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_primary_email,record2_primary_email) OR IsExactMatchFunction(record1_primary_phone,record2_primary_primary) ELSE 2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_primary_email,record2_primary_email) OR IsExactMatchFunction(record1_primary_phone,record2_primary_primary) ELSE 3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND IsExactMatchFunction(record1_lastname,record2_lastname) AND IsExactMatchFunction(record1_any_email,record2_any_email) OR IsExactMatchFunction(record1_any_phone,record2_any_primary) IsFuzzyMatchFunction() will perform distance and phonetic algorithms calculation and compare it with predefined threshold. For example: if threshold defined for firsname is 85 and IsFuzzyMatchFunction() function only return "ture" only and only if one of the algorithms(distance or phonetic) return the similarity socre >= 85. Can I use solr to perform this job. Or Can you guys suggest how can I approach to this problem. I have seen the duke(De duplication API) but I can not use duke out of the box.