On Thu, Jul 19, 2012 at 11:11 AM, Aaron Daubman <daub...@gmail.com> wrote:

> Apologies if I didn't clearly state my goal/concern: I am not looking for
> the exact same scoring - I am looking to explain scoring differences.
>  Deprecated components will eventually go away, time moves on, etc...
> etc... I would like to be able to run current code, and should be able to -
> the part that is sticking is being able to *explain* the difference in
> results.
>

OK: i totally missed that, sorry!

to explain why you see such a large difference:

The difference is that these length normalizations are computed at
index time and fit inside a *single byte* by default. This is to keep
ram usage low for many documents and many fields with norms (since its
#fieldsWithNorms * #documents in bytes in ram).
So this is lossy: basically you can think of there being only 256
possible values. So when you increased the number of terms only
slightly by changing your analysis, this happened to bump you over the
edge rounding you up to the next value.

more information:
http://lucene.apache.org/core/3_6_0/scoring.html
http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

by the way: if you don't like this:
1. if you can still live with a single byte, maybe plug in your own
Similarity class into 3.6, overriding decodeNormValue/encodeNormValue.
For example, you could use a different SmallFloat configuration that
has less range but more precision for your use case (if your docs are
all short or whatever)
2. otherwise, if you feel you need more than a single byte, check out
4.0-ALPHA: you arent limited to a single byte there.

-- 
lucidimagination.com

Reply via email to