FWIW, length for normalization is measured in terms (tokens), not
characters.

With TDIFS similarity (the default before 6.0), the normalization is based
on the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, <jimi.hulleg...@svensktnaringsliv.se>
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation is
> quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a few
> characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has fieldNorm
> 0.4375, and in document 2 the text is 37 characters long and has fieldNorm
> 0.375. That means that the first document gets almost a 20% higher score
> simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a lower
> character limit, meaning that all fields with a length below this limit
> gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for that
> field, but I would prefer to still have it, just limit its effect on short
> texts.
>
> Regards
> /Jimi
>
>
>

Reply via email to