Re: Checking for similar text (duplicates)

Mikhail Khludnev Thu, 09 Jan 2014 05:29:51 -0800

Hello Cristian,

Have you seen http://wiki.apache.org/solr/Deduplication ?



On Thu, Jan 9, 2014 at 5:01 PM, Cristian Bichis <cri...@imagis.ro> wrote:

> Hi,
>
> I have one app where the search part is based currently on something else
> than Solr. However, as the scale/demand and complexity grows I am looking
> at Solr for a potential better fit, including for some features currently
> implemented into scripting layer (so which are not on search currently). I
> am not quite familiar with Solr at this point, I am into early checking
> stage.
>
> One of the current app features is to detect /if there are/ similar
> records into index comparing with a potential new record and /which are
> these records/. In other words to check for duplicates (which are not
> necessary identical but would be very close to original). The comparison is
> made checking on a description field, which could contain couple hundreds
> words (and the words are NOT in English) for each record. Of course the
> comparison could be made more complex in the future, to compare 2-3 fields
> (a title, the description, additional keywords, etc).
>
> Currently this feature is implemented directly in PHP using similar_text,
> which for us has an advantage over levenshtein because it gives a straight
> % match score and we can decide if a record is a duplicate based on % score
> returned by similar_text (eg: if over 80% match then is a duplicate). The
> fact I have a score (filtering limit) for each record compared it helps me
> to decide/tweak the limit I consider is the milestone between duplicates
> and non-duplicates (I may decide the comparison is too strict and I may
> lower the threshold to 75%).
>
> Using levensthein (on php) would require additional processing so the
> performance benefit would be lost with this overhead. As well, on longer
> term any php implementation for this feature would be a performance
> bottleneck so this is not quite a solution.
>
> I am looking to move this "slow" operation into a more efficient
> environment, that's why I considered moving into search part this feature.
>
> I want to know if anyone has an efficient (working) solution based on Solr
> for this case. I am not sure if fuzzy search would be enough, I havent made
> a test case for this (yet).
>
> Thank you,
> Cristian
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

Re: Checking for similar text (duplicates)

Reply via email to