Re: Is it possible to configure a minimum field length for the fieldNorm value?

Jack Krupansky Wed, 20 Apr 2016 17:29:41 -0700

Maybe it's a cultural difference, but I can't imagine why on a query for
"John", any of those titles would be treated as anything other than equals
- namely, that they are all about John. Maybe the issue is that this seems
like a contrived example, and I'm asking for a realistic example. Or, maybe
you have some rule of relevance that you haven't yet shared - and I mean
rule that a user would comprehend and consider valuable, not simply a
mechanical rule.




-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:10 PM, <[email protected]>
wrote:

> Ok sure, I can try and give some examples :)
>
> Lets say that we have the following documents:
>
> Id: 1
> Title: John Doe
>
> Id: 2
> Title: John Doe Jr.
>
> Id: 3
> Title: John Lennon: The Life
>
> Id: 4
> Title: John Thompson's Modern Course for the Piano: First Grade Book
>
> Id: 5
> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of
> Mrs. Surratt
>
>
> And in general, when a search word matches the title, I would like to have
> the length of the title field influence the score, so that matching
> documents with shorter title get a higher score than documents with longer
> title, all else considered equal.
>
> So, when a user searches for "John", I would like the results to be pretty
> much in the order presented above. Though, it is not crucial that for
> example document 1 comes before document 2. But I would surely want
> document 1-3 to come before document 4 and 5.
>
> In my mind, the fieldNorm is a perfect solution for this. At least in
> theory. In practice, the encoding of the fieldNorm seems to make this
> function much less useful for this use case. Unless I have missed something.
>
> Is there another way to achive something like this? Note that I don't want
> a general boost on documents with short titles, I only want to boost them
> if the title field actually matched the query.
>
> /Jimi
>
> ________________________________________
> From: Jack Krupansky <[email protected]>
> Sent: Thursday, April 21, 2016 1:28 AM
> To: [email protected]
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> I'm not sure I fully follow what distinction you're trying to focus on. I
> mean, traditionally length normalization has simply tried to distinguish a
> title field (rarely more than a dozen words) from a full body of text, or
> maybe an abstract, not things like exactly how many words were in a title.
> Or, as another example, a short newswire article of a few paragraphs vs. a
> feature-length article, paper, or even book. IOW, traditionally it was more
> of a boolean than a broad range of values. Sure, yes, you absolutely can
> define a custom similarity with a custom norm that supports a wide range of
> lengths, but you'll have to decide what you really want  to achieve to tune
> it.
>
> Maybe you could give a couple examples of field values that you feel should
> be scored differently based on length.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 7:17 PM, <[email protected]>
> wrote:
>
> > I am talking about the title field. And for the title field, a sweetspot
> > interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> > value that differentiates between for example 2, 3, 4 and 5 terms in the
> > title, but only very little.
> >
> > The 20% number I got by simply calculating the difference in the title
> > fieldNorm of two documents, where one title was one word longer than the
> > other title. And one fieldNorm value was 20% larger then the other as a
> > result of that. And since we use multiplicative scoring calculation, a
> 20%
> > increase in the fieldNorm results in a 20% increase in the final score.
> >
> > I'm not talking about "scores as percentages". I'm simply noting that
> this
> > minor change in the text data (adding or removing one single word) causes
> > the score to change by a almost 20%. I noted this when I renamed a
> > document, removing a word from the title, and that single change caused
> the
> > document to move up several positions in the result list. We don't want
> > such minor modifications to have such big impact of the resulting score.
> >
> > I'm not sure I can agree with you that "the effect of document length
> > normalization factor is minimal". Then why does it inpact our result in
> > such a big way? And as I said, we don't want to disable it completely, we
> > just want it to have a much lesser effect, even on really short texts.
> >
> > /Jimi
> >
> > ________________________________________
> > From: Ahmet Arslan <[email protected]>
> > Sent: Thursday, April 21, 2016 12:10 AM
> > To: [email protected]
> > Subject: Re: Is it possible to configure a minimum field length for the
> > fieldNorm value?
> >
> > Hi Jimi,
> >
> > Please define a meaningful document-lenght range like min=1 max=50.
> > By the way you need to reindex every time you change something.
> >
> > Regarding 20% score change, I am not sure how you calculated that number
> > and I assume it is correct.
> > What really matters is the relative order of documents. It doesn't mean
> > anything addition of a word decreases the initial score by x%. Please
> see :
> > https://wiki.apache.org/lucene-java/ScoresAsPercentages
> >
> > There is an information retrieval heuristic which says that addition of a
> > non-query term should decrease the score.
> >
> > Lucene's default document length normalization may favor short document
> > too much. But folks blend score with other structural fields
> (popularity),
> > even completely bypass relevancy score and order by price, production
> date
> > etc. I mean there are many use cases, the effect of document length
> > normalization factor is minimal.
> >
> > Lucene/Solr is highly pluggable, very easy to customize.
> >
> > Ahmet
> >
> >
> > On Wednesday, April 20, 2016 11:05 PM, "
> > [email protected]" <
> [email protected]>
> > wrote:
> > Hi Ahmet,
> >
> > SweetSpotSimilarity seems quite nice. Some simple testing by throwing
> some
> > different values at the class gives quite good results. Setting ln_min=1,
> > ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
> > less what I want. At least for the title field. I'm not sure what the
> > actual effect of those settings would be on longer text fields, so maybe
> I
> > will use the SweetSpotSimilarity only for the title field to start with.
> >
> > Of course I understand that there are many things that can be considered
> > domain specific requirements, like if to favor/punish short/medium/long
> > texts, and how. I was just wondering how many actual use cases there are
> > where one want's a ~20% difference in score between two documents, where
> > the only difference is that one of the documents has one extra word in
> one
> > field. (And now I'm talking about an extra word that doesn't affect
> > anything else except the fieldNorm value). I for one find it hard to find
> > such a use case, and would consider it a very special use case, and would
> > consider a more lenient calculation a better fit for most use cases (and
> > therefore most domains). :)
> >
> > /Jimi
> >
> >
> > -----Original Message-----
> > From: Ahmet Arslan [mailto:[email protected]]
> > Sent: Wednesday, April 20, 2016 8:14 PM
> > To: [email protected]
> > Subject: Re: Is it possible to configure a minimum field length for the
> > fieldNorm value?
> >
> > Hi Jimi,
> >
> > SweetSpotSimilarity allows you define a document length range, so that
> all
> > documents in that range will get same fieldNorm value.
> > In your case, you can say that from 1 word up to 100 words do not employ
> > document length punishment. If a document is longer than 100 do some
> > punishment.
> >
> > By the way; favoring/punishing  short, middle, or long documents is
> domain
> > specific thing. You are free to decide what to do.
> >
> > Ahmet
> >
> >
> >
> > On Wednesday, April 20, 2016 7:46 PM, "
> [email protected]"
> > <[email protected]> wrote:
> > OK. Well, still, the fact that the score increases almost 20% because of
> > just one extra term in the field, is not really reasonable if you ask me.
> > But you seem to say that this is expected, reasonable and wanted behavior
> > for most use case?
> >
> > I'm not sure that I feel comfortable replacing the default Similarity
> > implementation with a custom one. That would just increase the complexity
> > of our setup and would make future upgrades harder (we would for example
> > have to remember to check if the default similarity configuration or
> > implementation changes).
> >
> > No, if it really is the case that most people like and want this, and
> > there is no way to configure Solr/Lucene to calculate fieldNorm in a more
> > reasonable way (in my book) for short field values, then I just think we
> > are forced to set omitNorms="true", maybe in combination with a simple
> > field boost for shorter fields.
> >
> > /Jimi
> >
> >
> >
> > -----Original Message-----
> > From: Jack Krupansky [mailto:[email protected]]
> > Sent: Wednesday, April 20, 2016 5:18 PM
> > To: [email protected]
> > Subject: Re: Is it possible to configure a minimum field length for the
> > fieldNorm value?
> >
> > FWIW, length for normalization is measured in terms (tokens), not
> > characters.
> >
> > With TDIFS similarity (the default before 6.0), the normalization is
> based
> > on the inverse square root of the number of terms in the field:
> >
> > return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
> >
> > That code is in ClassicSimilarity:
> >
> >
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
> >
> > You can always write your own custom Similarity class to override that
> > calculation.
> >
> > -- Jack Krupansky
> >
> > On Wed, Apr 20, 2016 at 10:43 AM, <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > In general I think that the fieldNorm factor in the score calculation
> > > is quite good. But when the text is short I think that the effect is
> two
> > big.
> > >
> > > Ie with two documents that have a short text in the same field, just a
> > > few characters extra in of the documents lower the fieldNorm factor too
> > much.
> > > In one test the text in document 1 is 30 characters long and has
> > > fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> > > has fieldNorm 0.375. That means that the first document gets almost a
> > > 20% higher score simply because of the 7 character difference.
> > >
> > > What are my options if I want to change this behavior? Can I set a
> > > lower character limit, meaning that all fields with a length below
> > > this limit gets the same fieldNorm value?
> > >
> > > I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> > > that field, but I would prefer to still have it, just limit its effect
> > > on short texts.
> > >
> > > Regards
> > > /Jimi
> > >
> > >
> > >
> >
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Reply via email to