I have a solr index full of documents that contain lots of duplicates.
The duplicates are not exact duplicates though. Each may vary slightly
in content.

After indexing, I have a bit of code that loops through the entire
index just to get what I'm calling "target" documents. For each target
document, I then send another query to find similar documents to the
"target". This similarity query includes a clause to match the target
to itself, so I can have a normalized max score. This was the only way
I could figure out how to reasonably fix the scoring range. The
response always includes the target at the top, and similar documents
afterward. So I take the scores and scale to 0-100, where 100 is
always the target matching itself. So far so good...

What I want to do is create a confidence score threshold, so I can
automatically accept similar documents that have a score above the
threshold. If my query *structure* never changes, but only the values
in the query change... is it possible to produce a reliable
"threshold" score that I could use?

Hope this makes sense :)

Matt

Reply via email to