No this isn't the MLT, just the standard query parser for now. I did try the heuristic approach and I might stick with that actually. I ran the process on known duplicates and created a collection of all scores. I was then able to see how well the query worked. The scores seemed focused to one range, which is promising.
I totally forgot about the de-duper, I'll have a look at that and see if I can get it to work. Thanks for your help, Matt On Wed, Oct 13, 2010 at 3:00 PM, Peter Karich <peat...@yahoo.de> wrote: > Hi, > > are you using moreLikeThis for that feature? > I have no suggestion for a reliable threshold, I think this depends > on the domain you are operating and is IMO only solvable with a heuristic. > It also depends on fields, boosts, ... > It could be that there is a 'score gap' between duplicates and none > duplicates > which you can try to find, but I don't know > > BTW: did you check: http://wiki.apache.org/solr/Deduplication > > If you need deduplication while querying you could determine > a hashvalue from the procedure above and index that into a different field. > Then you can use collapse feature on that field to remove duplicates. > > Regards, > Peter. > >> I have a solr index full of documents that contain lots of duplicates. >> The duplicates are not exact duplicates though. Each may vary slightly >> in content. >> >> After indexing, I have a bit of code that loops through the entire >> index just to get what I'm calling "target" documents. For each target >> document, I then send another query to find similar documents to the >> "target". This similarity query includes a clause to match the target >> to itself, so I can have a normalized max score. This was the only way >> I could figure out how to reasonably fix the scoring range. The >> response always includes the target at the top, and similar documents >> afterward. So I take the scores and scale to 0-100, where 100 is >> always the target matching itself. So far so good... >> >> What I want to do is create a confidence score threshold, so I can >> automatically accept similar documents that have a score above the >> threshold. If my query *structure* never changes, but only the values >> in the query change... is it possible to produce a reliable >> "threshold" score that I could use? >> >> Hope this makes sense :) >> >> Matt >> > > > -- > http://jetwick.com twitter search prototype > >