Hi,
I have one app where the search part is based currently on something
else than Solr. However, as the scale/demand and complexity grows I am
looking at Solr for a potential better fit, including for some features
currently implemented into scripting layer (so which are not on search
currently). I am not quite familiar with Solr at this point, I am into
early checking stage.
One of the current app features is to detect /if there are/ similar
records into index comparing with a potential new record and /which are
these records/. In other words to check for duplicates (which are not
necessary identical but would be very close to original). The comparison
is made checking on a description field, which could contain couple
hundreds words (and the words are NOT in English) for each record. Of
course the comparison could be made more complex in the future, to
compare 2-3 fields (a title, the description, additional keywords, etc).
Currently this feature is implemented directly in PHP using
similar_text, which for us has an advantage over levenshtein because it
gives a straight % match score and we can decide if a record is a
duplicate based on % score returned by similar_text (eg: if over 80%
match then is a duplicate). The fact I have a score (filtering limit)
for each record compared it helps me to decide/tweak the limit I
consider is the milestone between duplicates and non-duplicates (I may
decide the comparison is too strict and I may lower the threshold to 75%).
Using levensthein (on php) would require additional processing so the
performance benefit would be lost with this overhead. As well, on longer
term any php implementation for this feature would be a performance
bottleneck so this is not quite a solution.
I am looking to move this "slow" operation into a more efficient
environment, that's why I considered moving into search part this feature.
I want to know if anyone has an efficient (working) solution based on
Solr for this case. I am not sure if fuzzy search would be enough, I
havent made a test case for this (yet).
Thank you,
Cristian
- Checking for similar text (duplicates) Cristian Bichis
-