One possible "match" is using Python's FuzzyWuzzy https://github.com/seatgeek/fuzzywuzzy http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
> Date: Sat, 3 Jan 2015 13:24:17 +0530 > Subject: De Duplication using Solr > From: shanuu....@gmail.com > To: solr-user@lucene.apache.org > > I am trying to find out duplicate records based on distance and phonetic > algorithms. Can I utilize solr for that? I have following fields and > conditions to identify exact or possible duplicates. > > 1. Fields > prefix > suffix > firstname > lastname > email(primary_email1, email2, email3) > phone(primary_phone1, phone2, phone3) > 2. Conditions: > Two records said to be exact duplicates if > > 1. IsExactMatchFunction(record1_prefix, record2_prefix) AND > IsExactMatchFunction(record1_suffix, record2_suffix) AND > IsExactMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > Two records said to be possible duplicates if > > 1. IsExactMatchFunction(record1_prefix, record2_prefix) OR > IsExactMatchFunction(record1_suffix, record2_suffix) OR > IsExactMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > ELSE > 2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > ELSE > 3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_any_email,record2_any_email) OR > IsExactMatchFunction(record1_any_phone,record2_any_primary) > > IsFuzzyMatchFunction() will perform distance and phonetic algorithms > calculation and compare it with predefined threshold. > > For example: > > if threshold defined for firsname is 85 and IsFuzzyMatchFunction() function > only return "ture" only and only if one of the algorithms(distance or > phonetic) return the similarity socre >= 85. > > Can I use solr to perform this job. Or Can you guys suggest how can I > approach to this problem. I have seen the duke(De duplication API) but I > can not use duke out of the box.