On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Thanks Robert, > > The difference in scores is clear now so it shouldn't matter as queryNorm > doesn't affect ranking but coord does. Can you explain why coord is left out > now and why it is considered to skew results and why queryNorm skews results? > And which specific new ranking algorithms they confuse, BM25F?
I think its easiest to compare the two TF normalization functions, DefaultSimilarity really needs something like this because its function (sqrt) grows very fast for a single term. On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates rather quickly for a single term, so when multiple terms are being scored, huge numbers of occurrences of a single term won't dominate the overall score. You can see this visually here (give it a second to load, and imagine documentLength = averageDocumentLength and k=1.2): http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100 > > Also, i would expect the default SchemaSimilarityFactory to behave the same > as DefaultSimilarity this might raise some further confusion down the line. Thats ok: I'd rather the very expert case (Per-Field scoring) be trickier than have a trap for people that try to use any algorithm other than TFIDFSimilarity -- lucidimagination.com