Checking for similar text (duplicates)

Cristian Bichis Thu, 09 Jan 2014 05:03:07 -0800

Hi,

I have one app where the search part is based currently on somethingelse than Solr. However, as the scale/demand and complexity grows I amlooking at Solr for a potential better fit, including for some featurescurrently implemented into scripting layer (so which are not on searchcurrently). I am not quite familiar with Solr at this point, I am intoearly checking stage.

One of the current app features is to detect /if there are/ similarrecords into index comparing with a potential new record and /which arethese records/. In other words to check for duplicates (which are notnecessary identical but would be very close to original). The comparisonis made checking on a description field, which could contain couplehundreds words (and the words are NOT in English) for each record. Ofcourse the comparison could be made more complex in the future, tocompare 2-3 fields (a title, the description, additional keywords, etc).

Currently this feature is implemented directly in PHP usingsimilar_text, which for us has an advantage over levenshtein because itgives a straight % match score and we can decide if a record is aduplicate based on % score returned by similar_text (eg: if over 80%match then is a duplicate). The fact I have a score (filtering limit)for each record compared it helps me to decide/tweak the limit Iconsider is the milestone between duplicates and non-duplicates (I maydecide the comparison is too strict and I may lower the threshold to 75%).

Using levensthein (on php) would require additional processing so theperformance benefit would be lost with this overhead. As well, on longerterm any php implementation for this feature would be a performancebottleneck so this is not quite a solution.

I am looking to move this "slow" operation into a more efficientenvironment, that's why I considered moving into search part this feature.

I want to know if anyone has an efficient (working) solution based onSolr for this case. I am not sure if fuzzy search would be enough, Ihavent made a test case for this (yet).


Thank you,
Cristian

Checking for similar text (duplicates)

Reply via email to