On Thu, Jul 19, 2012 at 11:11 AM, Aaron Daubman <daub...@gmail.com> wrote:
> Apologies if I didn't clearly state my goal/concern: I am not looking for > the exact same scoring - I am looking to explain scoring differences. > Deprecated components will eventually go away, time moves on, etc... > etc... I would like to be able to run current code, and should be able to - > the part that is sticking is being able to *explain* the difference in > results. > OK: i totally missed that, sorry! to explain why you see such a large difference: The difference is that these length normalizations are computed at index time and fit inside a *single byte* by default. This is to keep ram usage low for many documents and many fields with norms (since its #fieldsWithNorms * #documents in bytes in ram). So this is lossy: basically you can think of there being only 256 possible values. So when you increased the number of terms only slightly by changing your analysis, this happened to bump you over the edge rounding you up to the next value. more information: http://lucene.apache.org/core/3_6_0/scoring.html http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html by the way: if you don't like this: 1. if you can still live with a single byte, maybe plug in your own Similarity class into 3.6, overriding decodeNormValue/encodeNormValue. For example, you could use a different SmallFloat configuration that has less range but more precision for your use case (if your docs are all short or whatever) 2. otherwise, if you feel you need more than a single byte, check out 4.0-ALPHA: you arent limited to a single byte there. -- lucidimagination.com