Yes, we do edismax per field boosting, with explicit boosting of the title 
field. So it sure makes length normalization less relevant. But not 
*completely* irrelevant, which is why I still want to have it as part of the 
scoring, just with much less impact that it currently has.

/Jimi
________________________________________
From: Jack Krupansky <jack.krupan...@gmail.com>
Sent: Thursday, April 21, 2016 4:46 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Or should this be higher rated about NY, since it's shorter:

* New York

Another though on length norms: with the advent of multi-field dismax with
per-field boosting, people tend to explicitly boost the title field so that
the traditional length normalization is less relevant.


-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:39 PM, Walter Underwood <wun...@wunderwood.org>
wrote:

> Sure, here are some real world examples from my time at Netflix.
>
> Is this movie twice as much about “new york”?
>
> * New York, New York
>
> Which one of these is the best match for “blade runner”:
>
> * Blade Runner: The Final Cut
> * Blade Runner: Theatrical & Director’s Cut
> * Blade Runner: Workprint
>
> http://dvd.netflix.com/Search?v1=blade+runner <
> http://dvd.netflix.com/Search?v1=blade+runner>
>
> At Netflix (when I was there), those were shown in popularity order with a
> boost function.
>
> And for stemming, should the movie “Saw” match “see”? Maybe not.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Apr 20, 2016, at 5:28 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
> >
> > Maybe it's a cultural difference, but I can't imagine why on a query for
> > "John", any of those titles would be treated as anything other than
> equals
> > - namely, that they are all about John. Maybe the issue is that this
> seems
> > like a contrived example, and I'm asking for a realistic example. Or,
> maybe
> > you have some rule of relevance that you haven't yet shared - and I mean
> > rule that a user would comprehend and consider valuable, not simply a
> > mechanical rule.
> >
> >
> >
> > -- Jack Krupansky
> >
> > On Wed, Apr 20, 2016 at 8:10 PM, <jimi.hulleg...@svensktnaringsliv.se>
> > wrote:
> >
> >> Ok sure, I can try and give some examples :)
> >>
> >> Lets say that we have the following documents:
> >>
> >> Id: 1
> >> Title: John Doe
> >>
> >> Id: 2
> >> Title: John Doe Jr.
> >>
> >> Id: 3
> >> Title: John Lennon: The Life
> >>
> >> Id: 4
> >> Title: John Thompson's Modern Course for the Piano: First Grade Book
> >>
> >> Id: 5
> >> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> >> Youngest Member of Jackson's Staff from John Brown's Raid to the
> Hanging of
> >> Mrs. Surratt
> >>
> >>
> >> And in general, when a search word matches the title, I would like to
> have
> >> the length of the title field influence the score, so that matching
> >> documents with shorter title get a higher score than documents with
> longer
> >> title, all else considered equal.
> >>
> >> So, when a user searches for "John", I would like the results to be
> pretty
> >> much in the order presented above. Though, it is not crucial that for
> >> example document 1 comes before document 2. But I would surely want
> >> document 1-3 to come before document 4 and 5.
> >>
> >> In my mind, the fieldNorm is a perfect solution for this. At least in
> >> theory. In practice, the encoding of the fieldNorm seems to make this
> >> function much less useful for this use case. Unless I have missed
> something.
> >>
> >> Is there another way to achive something like this? Note that I don't
> want
> >> a general boost on documents with short titles, I only want to boost
> them
> >> if the title field actually matched the query.
> >>
> >> /Jimi
> >>
> >> ________________________________________
> >> From: Jack Krupansky <jack.krupan...@gmail.com>
> >> Sent: Thursday, April 21, 2016 1:28 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Is it possible to configure a minimum field length for the
> >> fieldNorm value?
> >>
> >> I'm not sure I fully follow what distinction you're trying to focus on.
> I
> >> mean, traditionally length normalization has simply tried to
> distinguish a
> >> title field (rarely more than a dozen words) from a full body of text,
> or
> >> maybe an abstract, not things like exactly how many words were in a
> title.
> >> Or, as another example, a short newswire article of a few paragraphs
> vs. a
> >> feature-length article, paper, or even book. IOW, traditionally it was
> more
> >> of a boolean than a broad range of values. Sure, yes, you absolutely can
> >> define a custom similarity with a custom norm that supports a wide
> range of
> >> lengths, but you'll have to decide what you really want  to achieve to
> tune
> >> it.
> >>
> >> Maybe you could give a couple examples of field values that you feel
> should
> >> be scored differently based on length.
> >>
> >> -- Jack Krupansky
> >>
> >> On Wed, Apr 20, 2016 at 7:17 PM, <jimi.hulleg...@svensktnaringsliv.se>
> >> wrote:
> >>
> >>> I am talking about the title field. And for the title field, a
> sweetspot
> >>> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> >>> value that differentiates between for example 2, 3, 4 and 5 terms in
> the
> >>> title, but only very little.
> >>>
> >>> The 20% number I got by simply calculating the difference in the title
> >>> fieldNorm of two documents, where one title was one word longer than
> the
> >>> other title. And one fieldNorm value was 20% larger then the other as a
> >>> result of that. And since we use multiplicative scoring calculation, a
> >> 20%
> >>> increase in the fieldNorm results in a 20% increase in the final score.
> >>>
> >>> I'm not talking about "scores as percentages". I'm simply noting that
> >> this
> >>> minor change in the text data (adding or removing one single word)
> causes
> >>> the score to change by a almost 20%. I noted this when I renamed a
> >>> document, removing a word from the title, and that single change caused
> >> the
> >>> document to move up several positions in the result list. We don't want
> >>> such minor modifications to have such big impact of the resulting
> score.
> >>>
> >>> I'm not sure I can agree with you that "the effect of document length
> >>> normalization factor is minimal". Then why does it inpact our result in
> >>> such a big way? And as I said, we don't want to disable it completely,
> we
> >>> just want it to have a much lesser effect, even on really short texts.
> >>>
> >>> /Jimi
> >>>
> >>> ________________________________________
> >>> From: Ahmet Arslan <iori...@yahoo.com.INVALID>
> >>> Sent: Thursday, April 21, 2016 12:10 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Is it possible to configure a minimum field length for the
> >>> fieldNorm value?
> >>>
> >>> Hi Jimi,
> >>>
> >>> Please define a meaningful document-lenght range like min=1 max=50.
> >>> By the way you need to reindex every time you change something.
> >>>
> >>> Regarding 20% score change, I am not sure how you calculated that
> number
> >>> and I assume it is correct.
> >>> What really matters is the relative order of documents. It doesn't mean
> >>> anything addition of a word decreases the initial score by x%. Please
> >> see :
> >>> https://wiki.apache.org/lucene-java/ScoresAsPercentages
> >>>
> >>> There is an information retrieval heuristic which says that addition
> of a
> >>> non-query term should decrease the score.
> >>>
> >>> Lucene's default document length normalization may favor short document
> >>> too much. But folks blend score with other structural fields
> >> (popularity),
> >>> even completely bypass relevancy score and order by price, production
> >> date
> >>> etc. I mean there are many use cases, the effect of document length
> >>> normalization factor is minimal.
> >>>
> >>> Lucene/Solr is highly pluggable, very easy to customize.
> >>>
> >>> Ahmet
> >>>
> >>>
> >>> On Wednesday, April 20, 2016 11:05 PM, "
> >>> jimi.hulleg...@svensktnaringsliv.se" <
> >> jimi.hulleg...@svensktnaringsliv.se>
> >>> wrote:
> >>> Hi Ahmet,
> >>>
> >>> SweetSpotSimilarity seems quite nice. Some simple testing by throwing
> >> some
> >>> different values at the class gives quite good results. Setting
> ln_min=1,
> >>> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more
> or
> >>> less what I want. At least for the title field. I'm not sure what the
> >>> actual effect of those settings would be on longer text fields, so
> maybe
> >> I
> >>> will use the SweetSpotSimilarity only for the title field to start
> with.
> >>>
> >>> Of course I understand that there are many things that can be
> considered
> >>> domain specific requirements, like if to favor/punish short/medium/long
> >>> texts, and how. I was just wondering how many actual use cases there
> are
> >>> where one want's a ~20% difference in score between two documents,
> where
> >>> the only difference is that one of the documents has one extra word in
> >> one
> >>> field. (And now I'm talking about an extra word that doesn't affect
> >>> anything else except the fieldNorm value). I for one find it hard to
> find
> >>> such a use case, and would consider it a very special use case, and
> would
> >>> consider a more lenient calculation a better fit for most use cases
> (and
> >>> therefore most domains). :)
> >>>
> >>> /Jimi
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
> >>> Sent: Wednesday, April 20, 2016 8:14 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Is it possible to configure a minimum field length for the
> >>> fieldNorm value?
> >>>
> >>> Hi Jimi,
> >>>
> >>> SweetSpotSimilarity allows you define a document length range, so that
> >> all
> >>> documents in that range will get same fieldNorm value.
> >>> In your case, you can say that from 1 word up to 100 words do not
> employ
> >>> document length punishment. If a document is longer than 100 do some
> >>> punishment.
> >>>
> >>> By the way; favoring/punishing  short, middle, or long documents is
> >> domain
> >>> specific thing. You are free to decide what to do.
> >>>
> >>> Ahmet
> >>>
> >>>
> >>>
> >>> On Wednesday, April 20, 2016 7:46 PM, "
> >> jimi.hulleg...@svensktnaringsliv.se"
> >>> <jimi.hulleg...@svensktnaringsliv.se> wrote:
> >>> OK. Well, still, the fact that the score increases almost 20% because
> of
> >>> just one extra term in the field, is not really reasonable if you ask
> me.
> >>> But you seem to say that this is expected, reasonable and wanted
> behavior
> >>> for most use case?
> >>>
> >>> I'm not sure that I feel comfortable replacing the default Similarity
> >>> implementation with a custom one. That would just increase the
> complexity
> >>> of our setup and would make future upgrades harder (we would for
> example
> >>> have to remember to check if the default similarity configuration or
> >>> implementation changes).
> >>>
> >>> No, if it really is the case that most people like and want this, and
> >>> there is no way to configure Solr/Lucene to calculate fieldNorm in a
> more
> >>> reasonable way (in my book) for short field values, then I just think
> we
> >>> are forced to set omitNorms="true", maybe in combination with a simple
> >>> field boost for shorter fields.
> >>>
> >>> /Jimi
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> >>> Sent: Wednesday, April 20, 2016 5:18 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Is it possible to configure a minimum field length for the
> >>> fieldNorm value?
> >>>
> >>> FWIW, length for normalization is measured in terms (tokens), not
> >>> characters.
> >>>
> >>> With TDIFS similarity (the default before 6.0), the normalization is
> >> based
> >>> on the inverse square root of the number of terms in the field:
> >>>
> >>> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
> >>>
> >>> That code is in ClassicSimilarity:
> >>>
> >>>
> >>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
> >>>
> >>> You can always write your own custom Similarity class to override that
> >>> calculation.
> >>>
> >>> -- Jack Krupansky
> >>>
> >>> On Wed, Apr 20, 2016 at 10:43 AM, <jimi.hulleg...@svensktnaringsliv.se
> >
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> In general I think that the fieldNorm factor in the score calculation
> >>>> is quite good. But when the text is short I think that the effect is
> >> two
> >>> big.
> >>>>
> >>>> Ie with two documents that have a short text in the same field, just a
> >>>> few characters extra in of the documents lower the fieldNorm factor
> too
> >>> much.
> >>>> In one test the text in document 1 is 30 characters long and has
> >>>> fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> >>>> has fieldNorm 0.375. That means that the first document gets almost a
> >>>> 20% higher score simply because of the 7 character difference.
> >>>>
> >>>> What are my options if I want to change this behavior? Can I set a
> >>>> lower character limit, meaning that all fields with a length below
> >>>> this limit gets the same fieldNorm value?
> >>>>
> >>>> I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> >>>> that field, but I would prefer to still have it, just limit its effect
> >>>> on short texts.
> >>>>
> >>>> Regards
> >>>> /Jimi
> >>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Reply via email to