Re: Is it possible to configure a minimum field length for the fieldNorm value?

jimi.hullegard Wed, 20 Apr 2016 16:18:06 -0700

I am talking about the title field. And for the title field, a sweetspot 
interval of 1 to 50 makes very little sense. I want to have a fieldNorm value 
that differentiates between for example 2, 3, 4 and 5 terms in the title, but 
only very little.

The 20% number I got by simply calculating the difference in the title 
fieldNorm of two documents, where one title was one word longer than the other 
title. And one fieldNorm value was 20% larger then the other as a result of 
that. And since we use multiplicative scoring calculation, a 20% increase in 
the fieldNorm results in a 20% increase in the final score.

I'm not talking about "scores as percentages". I'm simply noting that this 
minor change in the text data (adding or removing one single word) causes the 
score to change by a almost 20%. I noted this when I renamed a document, 
removing a word from the title, and that single change caused the document to 
move up several positions in the result list. We don't want such minor 
modifications to have such big impact of the resulting score.

I'm not sure I can agree with you that "the effect of document length 
normalization factor is minimal". Then why does it inpact our result in such a 
big way? And as I said, we don't want to disable it completely, we just want it 
to have a much lesser effect, even on really short texts.

/Jimi

________________________________________
From: Ahmet Arslan <iori...@yahoo.com.INVALID>
Sent: Thursday, April 21, 2016 12:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

Please define a meaningful document-lenght range like min=1 max=50.
By the way you need to reindex every time you change something.

Regarding 20% score change, I am not sure how you calculated that number and I 
assume it is correct.
What really matters is the relative order of documents. It doesn't mean 
anything addition of a word decreases the initial score by x%. Please see :
https://wiki.apache.org/lucene-java/ScoresAsPercentages

There is an information retrieval heuristic which says that addition of a 
non-query term should decrease the score.

Lucene's default document length normalization may favor short document too 
much. But folks blend score with other structural fields (popularity), even 
completely bypass relevancy score and order by price, production date etc. I 
mean there are many use cases, the effect of document length normalization 
factor is minimal.

Lucene/Solr is highly pluggable, very easy to customize.

Ahmet

On Wednesday, April 20, 2016 11:05 PM, "jimi.hulleg...@svensktnaringsliv.se" 
<jimi.hulleg...@svensktnaringsliv.se> wrote:
Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some 
different values at the class gives quite good results. Setting ln_min=1, 
ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less 
what I want. At least for the title field. I'm not sure what the actual effect 
of those settings would be on longer text fields, so maybe I will use the 
SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain 
specific requirements, like if to favor/punish short/medium/long texts, and 
how. I was just wondering how many actual use cases there are where one want's 
a ~20% difference in score between two documents, where the only difference is 
that one of the documents has one extra word in one field. (And now I'm talking 
about an extra word that doesn't affect anything else except the fieldNorm 
value). I for one find it hard to find such a use case, and would consider it a 
very special use case, and would consider a more lenient calculation a better 
fit for most use cases (and therefore most domains). :)

/Jimi

-----Original Message-----
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all 
documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ 
document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain 
specific thing. You are free to decide what to do.

Ahmet

On Wednesday, April 20, 2016 7:46 PM, "jimi.hulleg...@svensktnaringsliv.se" 
<jimi.hulleg...@svensktnaringsliv.se> wrote:
OK. Well, still, the fact that the score increases almost 20% because of just 
one extra term in the field, is not really reasonable if you ask me. But you 
seem to say that this is expected, reasonable and wanted behavior for most use 
case?

I'm not sure that I feel comfortable replacing the default Similarity 
implementation with a custom one. That would just increase the complexity of 
our setup and would make future upgrades harder (we would for example have to 
remember to check if the default similarity configuration or implementation 
changes).

No, if it really is the case that most people like and want this, and there is 
no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way 
(in my book) for short field values, then I just think we are forced to set 
omitNorms="true", maybe in combination with a simple field boost for shorter 
fields.

/Jimi

-----Original Message-----
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on 
the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that 
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, <jimi.hulleg...@svensktnaringsliv.se>
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> has fieldNorm 0.375. That means that the first document gets almost a
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a
> lower character limit, meaning that all fields with a length below
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> that field, but I would prefer to still have it, just limit its effect
> on short texts.
>
> Regards
> /Jimi
>
>
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Reply via email to