Hello Cristian, Have you seen http://wiki.apache.org/solr/Deduplication ?
On Thu, Jan 9, 2014 at 5:01 PM, Cristian Bichis <cri...@imagis.ro> wrote: > Hi, > > I have one app where the search part is based currently on something else > than Solr. However, as the scale/demand and complexity grows I am looking > at Solr for a potential better fit, including for some features currently > implemented into scripting layer (so which are not on search currently). I > am not quite familiar with Solr at this point, I am into early checking > stage. > > One of the current app features is to detect /if there are/ similar > records into index comparing with a potential new record and /which are > these records/. In other words to check for duplicates (which are not > necessary identical but would be very close to original). The comparison is > made checking on a description field, which could contain couple hundreds > words (and the words are NOT in English) for each record. Of course the > comparison could be made more complex in the future, to compare 2-3 fields > (a title, the description, additional keywords, etc). > > Currently this feature is implemented directly in PHP using similar_text, > which for us has an advantage over levenshtein because it gives a straight > % match score and we can decide if a record is a duplicate based on % score > returned by similar_text (eg: if over 80% match then is a duplicate). The > fact I have a score (filtering limit) for each record compared it helps me > to decide/tweak the limit I consider is the milestone between duplicates > and non-duplicates (I may decide the comparison is too strict and I may > lower the threshold to 75%). > > Using levensthein (on php) would require additional processing so the > performance benefit would be lost with this overhead. As well, on longer > term any php implementation for this feature would be a performance > bottleneck so this is not quite a solution. > > I am looking to move this "slow" operation into a more efficient > environment, that's why I considered moving into search part this feature. > > I want to know if anyone has an efficient (working) solution based on Solr > for this case. I am not sure if fuzzy search would be enough, I havent made > a test case for this (yet). > > Thank you, > Cristian > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>