I'm not sure I fully follow what distinction you're trying to focus on. I mean, traditionally length normalization has simply tried to distinguish a title field (rarely more than a dozen words) from a full body of text, or maybe an abstract, not things like exactly how many words were in a title. Or, as another example, a short newswire article of a few paragraphs vs. a feature-length article, paper, or even book. IOW, traditionally it was more of a boolean than a broad range of values. Sure, yes, you absolutely can define a custom similarity with a custom norm that supports a wide range of lengths, but you'll have to decide what you really want to achieve to tune it.
Maybe you could give a couple examples of field values that you feel should be scored differently based on length. -- Jack Krupansky On Wed, Apr 20, 2016 at 7:17 PM, <jimi.hulleg...@svensktnaringsliv.se> wrote: > I am talking about the title field. And for the title field, a sweetspot > interval of 1 to 50 makes very little sense. I want to have a fieldNorm > value that differentiates between for example 2, 3, 4 and 5 terms in the > title, but only very little. > > The 20% number I got by simply calculating the difference in the title > fieldNorm of two documents, where one title was one word longer than the > other title. And one fieldNorm value was 20% larger then the other as a > result of that. And since we use multiplicative scoring calculation, a 20% > increase in the fieldNorm results in a 20% increase in the final score. > > I'm not talking about "scores as percentages". I'm simply noting that this > minor change in the text data (adding or removing one single word) causes > the score to change by a almost 20%. I noted this when I renamed a > document, removing a word from the title, and that single change caused the > document to move up several positions in the result list. We don't want > such minor modifications to have such big impact of the resulting score. > > I'm not sure I can agree with you that "the effect of document length > normalization factor is minimal". Then why does it inpact our result in > such a big way? And as I said, we don't want to disable it completely, we > just want it to have a much lesser effect, even on really short texts. > > /Jimi > > ________________________________________ > From: Ahmet Arslan <iori...@yahoo.com.INVALID> > Sent: Thursday, April 21, 2016 12:10 AM > To: solr-user@lucene.apache.org > Subject: Re: Is it possible to configure a minimum field length for the > fieldNorm value? > > Hi Jimi, > > Please define a meaningful document-lenght range like min=1 max=50. > By the way you need to reindex every time you change something. > > Regarding 20% score change, I am not sure how you calculated that number > and I assume it is correct. > What really matters is the relative order of documents. It doesn't mean > anything addition of a word decreases the initial score by x%. Please see : > https://wiki.apache.org/lucene-java/ScoresAsPercentages > > There is an information retrieval heuristic which says that addition of a > non-query term should decrease the score. > > Lucene's default document length normalization may favor short document > too much. But folks blend score with other structural fields (popularity), > even completely bypass relevancy score and order by price, production date > etc. I mean there are many use cases, the effect of document length > normalization factor is minimal. > > Lucene/Solr is highly pluggable, very easy to customize. > > Ahmet > > > On Wednesday, April 20, 2016 11:05 PM, " > jimi.hulleg...@svensktnaringsliv.se" <jimi.hulleg...@svensktnaringsliv.se> > wrote: > Hi Ahmet, > > SweetSpotSimilarity seems quite nice. Some simple testing by throwing some > different values at the class gives quite good results. Setting ln_min=1, > ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or > less what I want. At least for the title field. I'm not sure what the > actual effect of those settings would be on longer text fields, so maybe I > will use the SweetSpotSimilarity only for the title field to start with. > > Of course I understand that there are many things that can be considered > domain specific requirements, like if to favor/punish short/medium/long > texts, and how. I was just wondering how many actual use cases there are > where one want's a ~20% difference in score between two documents, where > the only difference is that one of the documents has one extra word in one > field. (And now I'm talking about an extra word that doesn't affect > anything else except the fieldNorm value). I for one find it hard to find > such a use case, and would consider it a very special use case, and would > consider a more lenient calculation a better fit for most use cases (and > therefore most domains). :) > > /Jimi > > > -----Original Message----- > From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] > Sent: Wednesday, April 20, 2016 8:14 PM > To: solr-user@lucene.apache.org > Subject: Re: Is it possible to configure a minimum field length for the > fieldNorm value? > > Hi Jimi, > > SweetSpotSimilarity allows you define a document length range, so that all > documents in that range will get same fieldNorm value. > In your case, you can say that from 1 word up to 100 words do not employ > document length punishment. If a document is longer than 100 do some > punishment. > > By the way; favoring/punishing short, middle, or long documents is domain > specific thing. You are free to decide what to do. > > Ahmet > > > > On Wednesday, April 20, 2016 7:46 PM, "jimi.hulleg...@svensktnaringsliv.se" > <jimi.hulleg...@svensktnaringsliv.se> wrote: > OK. Well, still, the fact that the score increases almost 20% because of > just one extra term in the field, is not really reasonable if you ask me. > But you seem to say that this is expected, reasonable and wanted behavior > for most use case? > > I'm not sure that I feel comfortable replacing the default Similarity > implementation with a custom one. That would just increase the complexity > of our setup and would make future upgrades harder (we would for example > have to remember to check if the default similarity configuration or > implementation changes). > > No, if it really is the case that most people like and want this, and > there is no way to configure Solr/Lucene to calculate fieldNorm in a more > reasonable way (in my book) for short field values, then I just think we > are forced to set omitNorms="true", maybe in combination with a simple > field boost for shorter fields. > > /Jimi > > > > -----Original Message----- > From: Jack Krupansky [mailto:jack.krupan...@gmail.com] > Sent: Wednesday, April 20, 2016 5:18 PM > To: solr-user@lucene.apache.org > Subject: Re: Is it possible to configure a minimum field length for the > fieldNorm value? > > FWIW, length for normalization is measured in terms (tokens), not > characters. > > With TDIFS similarity (the default before 6.0), the normalization is based > on the inverse square root of the number of terms in the field: > > return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms))); > > That code is in ClassicSimilarity: > > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115 > > You can always write your own custom Similarity class to override that > calculation. > > -- Jack Krupansky > > On Wed, Apr 20, 2016 at 10:43 AM, <jimi.hulleg...@svensktnaringsliv.se> > wrote: > > > Hi, > > > > In general I think that the fieldNorm factor in the score calculation > > is quite good. But when the text is short I think that the effect is two > big. > > > > Ie with two documents that have a short text in the same field, just a > > few characters extra in of the documents lower the fieldNorm factor too > much. > > In one test the text in document 1 is 30 characters long and has > > fieldNorm 0.4375, and in document 2 the text is 37 characters long and > > has fieldNorm 0.375. That means that the first document gets almost a > > 20% higher score simply because of the 7 character difference. > > > > What are my options if I want to change this behavior? Can I set a > > lower character limit, meaning that all fields with a length below > > this limit gets the same fieldNorm value? > > > > I know I can force fieldNorm to be 1 by setting omitNorms="true" for > > that field, but I would prefer to still have it, just limit its effect > > on short texts. > > > > Regards > > /Jimi > > > > > > >