Re: using score to find high confidence duplicates

2010-10-13 Thread Matt Mitchell
No this isn't the MLT, just the standard query parser for now. I did try the heuristic approach and I might stick with that actually. I ran the process on known duplicates and created a collection of all scores. I was then able to see how well the query worked. The scores seemed focused to one rang

Re: using score to find high confidence duplicates

2010-10-13 Thread Peter Karich
Hi, are you using moreLikeThis for that feature? I have no suggestion for a reliable threshold, I think this depends on the domain you are operating and is IMO only solvable with a heuristic. It also depends on fields, boosts, ... It could be that there is a 'score gap' between duplicates and none