Re: Checking for similar text (duplicates)

Cristian Bichis Thu, 09 Jan 2014 05:42:50 -0800

Hi Mikhail,

I seen deduplication part as well but I have some concerns:

1. Is deduplication supposed to work as well into a check-only (not tryto actually add new record to index) request ? So if I just check to seeif "could be" some duplicates of some text ?

2. As far as I seen the deduplication has some bottlenecks whencomparing extremely similar items (eg just one character difference). Icant find now the pages mentioning this but I am concerned this mightnot be reliable


Cristian

Hello Cristian,

Have you seen http://wiki.apache.org/solr/Deduplication ?


On Thu, Jan 9, 2014 at 5:01 PM, Cristian Bichis <cri...@imagis.ro> wrote:

Hi,

I have one app where the search part is based currently on something else
than Solr. However, as the scale/demand and complexity grows I am looking
at Solr for a potential better fit, including for some features currently
implemented into scripting layer (so which are not on search currently). I
am not quite familiar with Solr at this point, I am into early checking
stage.

One of the current app features is to detect /if there are/ similar
records into index comparing with a potential new record and /which are
these records/. In other words to check for duplicates (which are not
necessary identical but would be very close to original). The comparison is
made checking on a description field, which could contain couple hundreds
words (and the words are NOT in English) for each record. Of course the
comparison could be made more complex in the future, to compare 2-3 fields
(a title, the description, additional keywords, etc).

Currently this feature is implemented directly in PHP using similar_text,
which for us has an advantage over levenshtein because it gives a straight
% match score and we can decide if a record is a duplicate based on % score
returned by similar_text (eg: if over 80% match then is a duplicate). The
fact I have a score (filtering limit) for each record compared it helps me
to decide/tweak the limit I consider is the milestone between duplicates
and non-duplicates (I may decide the comparison is too strict and I may
lower the threshold to 75%).

Using levensthein (on php) would require additional processing so the
performance benefit would be lost with this overhead. As well, on longer
term any php implementation for this feature would be a performance
bottleneck so this is not quite a solution.

I am looking to move this "slow" operation into a more efficient
environment, that's why I considered moving into search part this feature.

I want to know if anyone has an efficient (working) solution based on Solr
for this case. I am not sure if fuzzy search would be enough, I havent made
a test case for this (yet).

Thank you,
Cristian

Re: Checking for similar text (duplicates)

Reply via email to