I have a solr index full of documents that contain lots of duplicates. The duplicates are not exact duplicates though. Each may vary slightly in content.
After indexing, I have a bit of code that loops through the entire index just to get what I'm calling "target" documents. For each target document, I then send another query to find similar documents to the "target". This similarity query includes a clause to match the target to itself, so I can have a normalized max score. This was the only way I could figure out how to reasonably fix the scoring range. The response always includes the target at the top, and similar documents afterward. So I take the scores and scale to 0-100, where 100 is always the target matching itself. So far so good... What I want to do is create a confidence score threshold, so I can automatically accept similar documents that have a score above the threshold. If my query *structure* never changes, but only the values in the query change... is it possible to produce a reliable "threshold" score that I could use? Hope this makes sense :) Matt