Yes, we do edismax per field boosting, with explicit boosting of the title field. So it sure makes length normalization less relevant. But not *completely* irrelevant, which is why I still want to have it as part of the scoring, just with much less impact that it currently has.
/Jimi ________________________________________ From: Jack Krupansky <jack.krupan...@gmail.com> Sent: Thursday, April 21, 2016 4:46 AM To: solr-user@lucene.apache.org Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value? Or should this be higher rated about NY, since it's shorter: * New York Another though on length norms: with the advent of multi-field dismax with per-field boosting, people tend to explicitly boost the title field so that the traditional length normalization is less relevant. -- Jack Krupansky On Wed, Apr 20, 2016 at 8:39 PM, Walter Underwood <wun...@wunderwood.org> wrote: > Sure, here are some real world examples from my time at Netflix. > > Is this movie twice as much about “new york”? > > * New York, New York > > Which one of these is the best match for “blade runner”: > > * Blade Runner: The Final Cut > * Blade Runner: Theatrical & Director’s Cut > * Blade Runner: Workprint > > http://dvd.netflix.com/Search?v1=blade+runner < > http://dvd.netflix.com/Search?v1=blade+runner> > > At Netflix (when I was there), those were shown in popularity order with a > boost function. > > And for stemming, should the movie “Saw” match “see”? Maybe not. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Apr 20, 2016, at 5:28 PM, Jack Krupansky <jack.krupan...@gmail.com> > wrote: > > > > Maybe it's a cultural difference, but I can't imagine why on a query for > > "John", any of those titles would be treated as anything other than > equals > > - namely, that they are all about John. Maybe the issue is that this > seems > > like a contrived example, and I'm asking for a realistic example. Or, > maybe > > you have some rule of relevance that you haven't yet shared - and I mean > > rule that a user would comprehend and consider valuable, not simply a > > mechanical rule. > > > > > > > > -- Jack Krupansky > > > > On Wed, Apr 20, 2016 at 8:10 PM, <jimi.hulleg...@svensktnaringsliv.se> > > wrote: > > > >> Ok sure, I can try and give some examples :) > >> > >> Lets say that we have the following documents: > >> > >> Id: 1 > >> Title: John Doe > >> > >> Id: 2 > >> Title: John Doe Jr. > >> > >> Id: 3 > >> Title: John Lennon: The Life > >> > >> Id: 4 > >> Title: John Thompson's Modern Course for the Piano: First Grade Book > >> > >> Id: 5 > >> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the > >> Youngest Member of Jackson's Staff from John Brown's Raid to the > Hanging of > >> Mrs. Surratt > >> > >> > >> And in general, when a search word matches the title, I would like to > have > >> the length of the title field influence the score, so that matching > >> documents with shorter title get a higher score than documents with > longer > >> title, all else considered equal. > >> > >> So, when a user searches for "John", I would like the results to be > pretty > >> much in the order presented above. Though, it is not crucial that for > >> example document 1 comes before document 2. But I would surely want > >> document 1-3 to come before document 4 and 5. > >> > >> In my mind, the fieldNorm is a perfect solution for this. At least in > >> theory. In practice, the encoding of the fieldNorm seems to make this > >> function much less useful for this use case. Unless I have missed > something. > >> > >> Is there another way to achive something like this? Note that I don't > want > >> a general boost on documents with short titles, I only want to boost > them > >> if the title field actually matched the query. > >> > >> /Jimi > >> > >> ________________________________________ > >> From: Jack Krupansky <jack.krupan...@gmail.com> > >> Sent: Thursday, April 21, 2016 1:28 AM > >> To: solr-user@lucene.apache.org > >> Subject: Re: Is it possible to configure a minimum field length for the > >> fieldNorm value? > >> > >> I'm not sure I fully follow what distinction you're trying to focus on. > I > >> mean, traditionally length normalization has simply tried to > distinguish a > >> title field (rarely more than a dozen words) from a full body of text, > or > >> maybe an abstract, not things like exactly how many words were in a > title. > >> Or, as another example, a short newswire article of a few paragraphs > vs. a > >> feature-length article, paper, or even book. IOW, traditionally it was > more > >> of a boolean than a broad range of values. Sure, yes, you absolutely can > >> define a custom similarity with a custom norm that supports a wide > range of > >> lengths, but you'll have to decide what you really want to achieve to > tune > >> it. > >> > >> Maybe you could give a couple examples of field values that you feel > should > >> be scored differently based on length. > >> > >> -- Jack Krupansky > >> > >> On Wed, Apr 20, 2016 at 7:17 PM, <jimi.hulleg...@svensktnaringsliv.se> > >> wrote: > >> > >>> I am talking about the title field. And for the title field, a > sweetspot > >>> interval of 1 to 50 makes very little sense. I want to have a fieldNorm > >>> value that differentiates between for example 2, 3, 4 and 5 terms in > the > >>> title, but only very little. > >>> > >>> The 20% number I got by simply calculating the difference in the title > >>> fieldNorm of two documents, where one title was one word longer than > the > >>> other title. And one fieldNorm value was 20% larger then the other as a > >>> result of that. And since we use multiplicative scoring calculation, a > >> 20% > >>> increase in the fieldNorm results in a 20% increase in the final score. > >>> > >>> I'm not talking about "scores as percentages". I'm simply noting that > >> this > >>> minor change in the text data (adding or removing one single word) > causes > >>> the score to change by a almost 20%. I noted this when I renamed a > >>> document, removing a word from the title, and that single change caused > >> the > >>> document to move up several positions in the result list. We don't want > >>> such minor modifications to have such big impact of the resulting > score. > >>> > >>> I'm not sure I can agree with you that "the effect of document length > >>> normalization factor is minimal". Then why does it inpact our result in > >>> such a big way? And as I said, we don't want to disable it completely, > we > >>> just want it to have a much lesser effect, even on really short texts. > >>> > >>> /Jimi > >>> > >>> ________________________________________ > >>> From: Ahmet Arslan <iori...@yahoo.com.INVALID> > >>> Sent: Thursday, April 21, 2016 12:10 AM > >>> To: solr-user@lucene.apache.org > >>> Subject: Re: Is it possible to configure a minimum field length for the > >>> fieldNorm value? > >>> > >>> Hi Jimi, > >>> > >>> Please define a meaningful document-lenght range like min=1 max=50. > >>> By the way you need to reindex every time you change something. > >>> > >>> Regarding 20% score change, I am not sure how you calculated that > number > >>> and I assume it is correct. > >>> What really matters is the relative order of documents. It doesn't mean > >>> anything addition of a word decreases the initial score by x%. Please > >> see : > >>> https://wiki.apache.org/lucene-java/ScoresAsPercentages > >>> > >>> There is an information retrieval heuristic which says that addition > of a > >>> non-query term should decrease the score. > >>> > >>> Lucene's default document length normalization may favor short document > >>> too much. But folks blend score with other structural fields > >> (popularity), > >>> even completely bypass relevancy score and order by price, production > >> date > >>> etc. I mean there are many use cases, the effect of document length > >>> normalization factor is minimal. > >>> > >>> Lucene/Solr is highly pluggable, very easy to customize. > >>> > >>> Ahmet > >>> > >>> > >>> On Wednesday, April 20, 2016 11:05 PM, " > >>> jimi.hulleg...@svensktnaringsliv.se" < > >> jimi.hulleg...@svensktnaringsliv.se> > >>> wrote: > >>> Hi Ahmet, > >>> > >>> SweetSpotSimilarity seems quite nice. Some simple testing by throwing > >> some > >>> different values at the class gives quite good results. Setting > ln_min=1, > >>> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more > or > >>> less what I want. At least for the title field. I'm not sure what the > >>> actual effect of those settings would be on longer text fields, so > maybe > >> I > >>> will use the SweetSpotSimilarity only for the title field to start > with. > >>> > >>> Of course I understand that there are many things that can be > considered > >>> domain specific requirements, like if to favor/punish short/medium/long > >>> texts, and how. I was just wondering how many actual use cases there > are > >>> where one want's a ~20% difference in score between two documents, > where > >>> the only difference is that one of the documents has one extra word in > >> one > >>> field. (And now I'm talking about an extra word that doesn't affect > >>> anything else except the fieldNorm value). I for one find it hard to > find > >>> such a use case, and would consider it a very special use case, and > would > >>> consider a more lenient calculation a better fit for most use cases > (and > >>> therefore most domains). :) > >>> > >>> /Jimi > >>> > >>> > >>> -----Original Message----- > >>> From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] > >>> Sent: Wednesday, April 20, 2016 8:14 PM > >>> To: solr-user@lucene.apache.org > >>> Subject: Re: Is it possible to configure a minimum field length for the > >>> fieldNorm value? > >>> > >>> Hi Jimi, > >>> > >>> SweetSpotSimilarity allows you define a document length range, so that > >> all > >>> documents in that range will get same fieldNorm value. > >>> In your case, you can say that from 1 word up to 100 words do not > employ > >>> document length punishment. If a document is longer than 100 do some > >>> punishment. > >>> > >>> By the way; favoring/punishing short, middle, or long documents is > >> domain > >>> specific thing. You are free to decide what to do. > >>> > >>> Ahmet > >>> > >>> > >>> > >>> On Wednesday, April 20, 2016 7:46 PM, " > >> jimi.hulleg...@svensktnaringsliv.se" > >>> <jimi.hulleg...@svensktnaringsliv.se> wrote: > >>> OK. Well, still, the fact that the score increases almost 20% because > of > >>> just one extra term in the field, is not really reasonable if you ask > me. > >>> But you seem to say that this is expected, reasonable and wanted > behavior > >>> for most use case? > >>> > >>> I'm not sure that I feel comfortable replacing the default Similarity > >>> implementation with a custom one. That would just increase the > complexity > >>> of our setup and would make future upgrades harder (we would for > example > >>> have to remember to check if the default similarity configuration or > >>> implementation changes). > >>> > >>> No, if it really is the case that most people like and want this, and > >>> there is no way to configure Solr/Lucene to calculate fieldNorm in a > more > >>> reasonable way (in my book) for short field values, then I just think > we > >>> are forced to set omitNorms="true", maybe in combination with a simple > >>> field boost for shorter fields. > >>> > >>> /Jimi > >>> > >>> > >>> > >>> -----Original Message----- > >>> From: Jack Krupansky [mailto:jack.krupan...@gmail.com] > >>> Sent: Wednesday, April 20, 2016 5:18 PM > >>> To: solr-user@lucene.apache.org > >>> Subject: Re: Is it possible to configure a minimum field length for the > >>> fieldNorm value? > >>> > >>> FWIW, length for normalization is measured in terms (tokens), not > >>> characters. > >>> > >>> With TDIFS similarity (the default before 6.0), the normalization is > >> based > >>> on the inverse square root of the number of terms in the field: > >>> > >>> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms))); > >>> > >>> That code is in ClassicSimilarity: > >>> > >>> > >> > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115 > >>> > >>> You can always write your own custom Similarity class to override that > >>> calculation. > >>> > >>> -- Jack Krupansky > >>> > >>> On Wed, Apr 20, 2016 at 10:43 AM, <jimi.hulleg...@svensktnaringsliv.se > > > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>> In general I think that the fieldNorm factor in the score calculation > >>>> is quite good. But when the text is short I think that the effect is > >> two > >>> big. > >>>> > >>>> Ie with two documents that have a short text in the same field, just a > >>>> few characters extra in of the documents lower the fieldNorm factor > too > >>> much. > >>>> In one test the text in document 1 is 30 characters long and has > >>>> fieldNorm 0.4375, and in document 2 the text is 37 characters long and > >>>> has fieldNorm 0.375. That means that the first document gets almost a > >>>> 20% higher score simply because of the 7 character difference. > >>>> > >>>> What are my options if I want to change this behavior? Can I set a > >>>> lower character limit, meaning that all fields with a length below > >>>> this limit gets the same fieldNorm value? > >>>> > >>>> I know I can force fieldNorm to be 1 by setting omitNorms="true" for > >>>> that field, but I would prefer to still have it, just limit its effect > >>>> on short texts. > >>>> > >>>> Regards > >>>> /Jimi > >>>> > >>>> > >>>> > >>> > >> > >