Re: per-fieldtype similarity not working

Robert Muir Fri, 08 Jun 2012 04:05:34 -0700

On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma
<markus.jel...@openindex.io> wrote:
> Thanks Robert,
>
> The difference in scores is clear now so it shouldn't matter as queryNorm 
> doesn't affect ranking but coord does. Can you explain why coord is left out 
> now and why it is considered to skew results and why queryNorm skews results? 
> And which specific new ranking algorithms they confuse, BM25F?


I think its easiest to compare the two TF normalization functions,
DefaultSimilarity really needs something like this because its
function (sqrt) grows very fast for a single term.
On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates
rather quickly for a single term, so when multiple terms are being
scored, huge numbers of occurrences of a single term won't dominate
the overall score.

You can see this visually here (give it a second to load, and imagine
documentLength = averageDocumentLength and k=1.2):
http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100

>
> Also, i would expect the default SchemaSimilarityFactory to behave the same 
> as DefaultSimilarity this might raise some further confusion down the line.

Thats ok: I'd rather the very expert case (Per-Field scoring) be
trickier than have a trap for people that try to use any algorithm
other than TFIDFSimilarity

-- 
lucidimagination.com

Re: per-fieldtype similarity not working

Reply via email to