Hi,

I have one app where the search part is based currently on something else than Solr. However, as the scale/demand and complexity grows I am looking at Solr for a potential better fit, including for some features currently implemented into scripting layer (so which are not on search currently). I am not quite familiar with Solr at this point, I am into early checking stage.

One of the current app features is to detect /if there are/ similar records into index comparing with a potential new record and /which are these records/. In other words to check for duplicates (which are not necessary identical but would be very close to original). The comparison is made checking on a description field, which could contain couple hundreds words (and the words are NOT in English) for each record. Of course the comparison could be made more complex in the future, to compare 2-3 fields (a title, the description, additional keywords, etc).

Currently this feature is implemented directly in PHP using similar_text, which for us has an advantage over levenshtein because it gives a straight % match score and we can decide if a record is a duplicate based on % score returned by similar_text (eg: if over 80% match then is a duplicate). The fact I have a score (filtering limit) for each record compared it helps me to decide/tweak the limit I consider is the milestone between duplicates and non-duplicates (I may decide the comparison is too strict and I may lower the threshold to 75%).

Using levensthein (on php) would require additional processing so the performance benefit would be lost with this overhead. As well, on longer term any php implementation for this feature would be a performance bottleneck so this is not quite a solution.

I am looking to move this "slow" operation into a more efficient environment, that's why I considered moving into search part this feature.

I want to know if anyone has an efficient (working) solution based on Solr for this case. I am not sure if fuzzy search would be enough, I havent made a test case for this (yet).

Thank you,
Cristian

Reply via email to